crwlrsoft / crawler

Library for Rapid (Web) Crawler and Scraper Development
https://www.crwlr.software/packages/crawler
MIT License
312 stars 11 forks source link

Documentation request #131

Closed derjochenmeyer closed 5 months ago

derjochenmeyer commented 5 months ago

I want to add Response Data to the Result as documented here.

use Crwlr\Crawler\Steps\Loading\Http;

$crawler
    ->input('https://www.example.com')
    ->addStep(
        Http::get()
            ->addToResult(['url', 'status', 'headers', 'body'])
    );

From the documentation I cannot figure out how to add this step to my working code which looks like this:

$crawler->input('https://www.example.com/sitemap.xml')
    ->addStep(
        Http::crawl()
            ->inputIsSitemap()
            ->maxOutputs(5)
    )
    ->addStep(
        Crawler::group()
            ->addStep(
              Html::root()
                  ->extract([
                      'title' => 'h1',
                      'date' => '#date',
                  ])
            )
            ->addToResult(['page'])
            ->addStep(
              Html::metaData()
                  ->only(['keywords', 'publisher'])
            )
            ->addToResult()
    );

Is there a way to add a Http::get() step to this approach? Or is there another sulution?

otsch commented 5 months ago

Hey @derjochenmeyer 👋

The Http::crawl() produces the same output as all other Http steps, so you can do:

$crawler->input('https://www.example.com/sitemap.xml')
    ->addStep(
        Http::crawl()
            ->inputIsSitemap()
            ->maxOutputs(5)
            ->addToResult(['url', 'status', 'headers', 'body'])
    )
    ->addStep(
        Crawler::group()
            ->addStep(
                Html::root()
                    ->extract([
                        'title' => 'h1',
                        'date' => '#date',
                    ])
            )
            ->addStep(
                Html::metaData()
                    ->only(['keywords', 'publisher'])
            )
            ->addToResult()
    );

Hope that solves your problem? I'll try to add this information to the docs somehow 👍🏻

derjochenmeyer commented 5 months ago

Mille Grazie! That solved it.