crwlrsoft / crawler

Library for Rapid (Web) Crawler and Scraper Development
https://www.crwlr.software/packages/crawler
MIT License
312 stars 11 forks source link

sub steps #103

Closed TheCrealm closed 1 year ago

TheCrealm commented 1 year ago

Is there a way to create sub steps for outputs?

i've crawled a list of book series and got this output array:

[
   {
      "title":"a Book series",
      "author":"book series author",
      "volumes":[
         "..list of urls.."
      ]
   },
   {
      "title":"Just another series",
      "author":"best author",
      "volumes":[
         "..list of urls.."
      ]
   }
]

now i want to make subrequests to the urls to get an output array like this:

[
   {
      "title":"a Book series",
      "author":"book series author",
      "volumes":[
         {
            "title":"A book series - part 1",
            "volumeNumber":1,
            "price":2499
         },
         {
            "title":"A book series - part 2",
            "volumeNumber":2,
            "price":2599
         }
      ]
   },
   {
      "title":"Just another series",
      "author":"best author",
      "volumes":[
         {
            "title":"Just another series - the good ones",
            "volumeNumber":1,
            "price":1999
         },
         {
            "title":"Just another series - the bad ones",
            "volumeNumber":2,
            "price":2699
         }
      ]
   }
]

the most practical solution i found is to use a transformer and invoke a second crawler.. but this seems not very practical to me. Is there maybe already a better way to accomplish this?

otsch commented 1 year ago

Hey @TheCrealm.

OK, so I presume the arrays in your examples, represent multiple outputs. So, I tested this with a step like:

class MyStep extends Step
{
    protected function invoke(mixed $input): Generator
    {
        yield [
            'title' => 'Something',
            'author' => 'Someone',
            'volumes' => [
                'https://www.crwlr.software/packages/crawler/v1.1/steps-and-data-flow/custom-steps',
                'https://www.crwlr.software/packages/crawler/v1.1/steps-and-data-flow/compose-results',
                'https://www.crwlr.software/packages/crawler/v1.1/steps-and-data-flow/groups',
            ]
        ];

        yield [
            'title' => 'Something else',
            'author' => 'Someone else',
            'volumes' => [
                'https://www.crwlr.software/packages/crawler/v1.0/steps-and-data-flow/custom-steps',
                'https://www.crwlr.software/packages/crawler/v1.0/steps-and-data-flow/compose-results',
                'https://www.crwlr.software/packages/crawler/v1.0/steps-and-data-flow/groups',
            ]
        ];
    }
}

You can achieve getting (almost) the result structure, that you want to get, like:

$crawler
    ->input('https://www.example.com')
    ->addStep(
        (new MyStep())->addToResult() // The step that creates the mentioned output with a list of URLs as 'volumes'
    )
    ->addStep(
        Http::get()->useInputKey('volumes') // Using the 'volumes' loads all the URLs and yields them each as a separate output.
    )
    ->addStep(
        Html::root()
            ->extract([...])              // This HTML extract step produces array output
            ->addToResult('volumesData')  // and using `addToResult()` like this, adds those outputs
                                          // with the key `volumesData` to the result.
    );

To explain this a little further: The Result object is initialized when adding the output data from MyStep to the result. When a step that is later in the chain, yields multiple outputs from one input (which is the case for the next Http::get(), because the input that it is invoked with, is one whole output from MyStep. And it produces multiple outputs because it uses the array of volumes), all the outputs add data to only one Result object (See this visualization in the docs). So, when in the third step, it repeatedly adds its data to the volumesData property, it becomes an array in the Result object. Assuming the third step produces outputs like ['foo' => '...', 'bar' => '...'], the final results then look like:

[
    'title' => 'Something',
    'author' => 'Someone',
    'volumes' => [
        'https://www.crwlr.software/packages/crawler/v1.1/steps-and-data-flow/custom-steps',
        'https://www.crwlr.software/packages/crawler/v1.1/steps-and-data-flow/compose-results',
        'https://www.crwlr.software/packages/crawler/v1.1/steps-and-data-flow/groups',
    ],
    'volumesData' => [
           ['foo' => '...', 'bar' => '...'],  // extracted from the first URL from volumes.
           ['foo' => '...', 'bar' => '...'],  // extracted from the second URL from volumes.
           ['foo' => '...', 'bar' => '...'],  // extracted from the third URL from volumes.
    ]

Watch out: you can use addToResult() in different ways:

I know it's not 100% what you wanted, because it still contains the array of volume URLs separately. I'm thinking about adding some new method like replaceInResult() that you can use instead of addToResult() to solve this problem. What would you think about that?

otsch commented 1 year ago

Ah and btw. I just made a little bugfix, for an issue I discovered when testing this: https://github.com/crwlrsoft/crawler/releases/tag/v1.1.1 So, please upgrade to the latest version when trying this.

TheCrealm commented 1 year ago

Hey @otsch,

Thanks for this comprehensive answer. It works! I had some misunderstanding how this library works mostly due to the fact that i never worked with PHP Generators. After your answer and further research about generators i now have a tiny clue how it works.

For reference, this is the real world crawler i built :)

$crawler->input('https://altraverse.de/manga/')
    ->addStep(Http::get())
    ->addStep(
        Html::getLinks('.navigation--link')
            ->where(Filter::urlPathMatches('^\/manga\/[[:alnum:]]'))
    )
    ->addStep(Http::get()->keepInputData('url')->outputKey('response')->addLaterToResult(['url']))
    ->addStep(
        Html::root()->extract([
            'title' => Dom::cssSelector('.hero--headline')->text(),
            'description' => Dom::cssSelector('.teaser--text-long')->first()->text(),
            'volumeUrls' => Dom::cssSelector('.product--title')->link()
        ])
            // Some pages do not have volumes as it's an upcoming series! null not allowed
            ->refineOutput(function (mixed $outputData, mixed $originalInputData) {
                $outputData['volumeUrls'] = $outputData['volumeUrls'] ?? [];
                return $outputData;
            })
            ->useInputKey("response")->addToResult(['title', 'description'])
    )
    ->addStep(Http::get()->useInputKey('volumeUrls'))
    ->addStep(
        Html::root()->extract([
            'title' => Dom::cssSelector('.product--title')->first()->text(),
            'entry-keys' => Dom::cssSelector('.entry--label')->text(),
            'entry-values' => Dom::cssSelector('.entry--content')->text()
        ])
            ->refineOutput(function (mixed $outputData, mixed $originalInputData) {
                $merged = [];
                foreach ($outputData['entry-keys'] as $i => $key) {
                    $merged[$key] = $outputData['entry-values'][$i];
                }

                $outputData['metadata'] = $merged;
                unset($outputData['entry-keys'], $outputData['entry-values']);
                return $outputData;
            })
            ->addToResult('volumes')
    );
otsch commented 1 year ago

Hey, first of all, thanks for sponsoring! 🫶

I totally understand that. I actually never used Generators before writing this library. In fact I started without using Generators for the steps, but I soon found that they are great in the Crawler/Scraper context to be as memory efficient as possible.

Nice! It's always great to see how people actually use the library. Smart solution that you're just adding title and description to the result in the fourth step, so you don't have the volumeUrls in the final result 👍🏻 And I also like the refiner to get the Metadata in the last step a lot! I was already thinking about a solution for the Html steps to be able to have dynamic output keys using CSS selectors. I'll add this in one of the next versions.

Some minor improvements that you could probably make:

So, the first four steps would look like this:

$crawler
    ->input('https://altraverse.de/manga/')
    ->addStep(Http::get())
    ->addStep(
        Html::getLinks('.navigation--link')
            ->where(Filter::urlPathMatches('^\/manga\/[[:alnum:]]'))
            ->addToResult('url')
    )
    ->addStep(Http::get())
    ->addStep(
        Html::root()->extract([
            'title' => '.hero--headline',
            'description' => Dom::cssSelector('.teaser--text-long')->first()->text(),
            'volumeUrls' => Dom::cssSelector('.product--title')->link()
        ])
            // Some pages do not have volumes as it's an upcoming series! null not allowed
            ->refineOutput(function (mixed $outputData, mixed $originalInputData) {
                $outputData['volumeUrls'] = $outputData['volumeUrls'] ?? [];
                return $outputData;
            })
            ->addToResult(['title', 'description'])
    )

And another tip: when the definition of the crawling procedure grows it can hurt readability. I don't know if you like it, but when a procedure grows bigger, I like to put it in a class and build the steps in methods with descriptive names, like:

class MyCrawlingProcedure
{
    private HttpCrawler $crawler;

    public function __construct()
    {
        $this->crawler = HttpCrawler::make()->withBotUserAgent('MyCrawler');

        $this->crawler
            ->input('https://altraverse.de/manga/')
            ->addStep(Http::get())
            ->addStep($this->getLinksFromListPage())
            ->addStep(Http::get())
            ->addStep($this->getDataFromDetailPage())
            ->addStep(Http::get()->useInputKey('volumeUrls'))
            ->addStep($this->getDataFromVolumePage());
    }

    public function run(): Generator
    {
        return $this->crawler->run();
    }

    private function getLinksFromListPage(): StepInterface
    {
        return Html::getLinks('.navigation--link')
            ->where(Filter::urlPathMatches('^\/manga\/[[:alnum:]]'))
            ->addLaterToResult('url');
    }

    private function getDataFromDetailPage(): StepInterface
    {
        return Html::root()->extract([
            'title' => '.hero--headline',
            'description' => Dom::cssSelector('.teaser--text-long')->first()->text(),
            'volumeUrls' => Dom::cssSelector('.product--title')->link()
        ])
            // Some pages do not have volumes as it's an upcoming series! null not allowed
            ->refineOutput(function (mixed $outputData, mixed $originalInputData) {
                $outputData['volumeUrls'] = $outputData['volumeUrls'] ?? [];
                return $outputData;
            })
            ->addToResult(['title', 'description']);
    }

    private function getDataFromVolumePage(): StepInterface
    {
        return Html::root()->extract([
            'title' => Dom::cssSelector('.product--title')->first()->text(),
            'entry-keys' => '.entry--label',
            'entry-values' => '.entry--content',
        ])
            ->refineOutput(function (mixed $outputData, mixed $originalInputData) {
                $merged = [];
                foreach ($outputData['entry-keys'] as $i => $key) {
                    $merged[$key] = $outputData['entry-values'][$i];
                }

                $outputData['metadata'] = $merged;
                unset($outputData['entry-keys'], $outputData['entry-values']);
                return $outputData;
            })
            ->addToResult('volumes');
    }
}