sub steps - Githubissues

Is there a way to create sub steps for outputs?

i've crawled a list of book series and got this output array:

[
   {
      "title":"a Book series",
      "author":"book series author",
      "volumes":[
         "..list of urls.."
      ]
   },
   {
      "title":"Just another series",
      "author":"best author",
      "volumes":[
         "..list of urls.."
      ]
   }
]

now i want to make subrequests to the urls to get an output array like this:

[
   {
      "title":"a Book series",
      "author":"book series author",
      "volumes":[
         {
            "title":"A book series - part 1",
            "volumeNumber":1,
            "price":2499
         },
         {
            "title":"A book series - part 2",
            "volumeNumber":2,
            "price":2599
         }
      ]
   },
   {
      "title":"Just another series",
      "author":"best author",
      "volumes":[
         {
            "title":"Just another series - the good ones",
            "volumeNumber":1,
            "price":1999
         },
         {
            "title":"Just another series - the bad ones",
            "volumeNumber":2,
            "price":2699
         }
      ]
   }
]

the most practical solution i found is to use a transformer and invoke a second crawler.. but this seems not very practical to me. Is there maybe already a better way to accomplish this?

Hey @TheCrealm.

OK, so I presume the arrays in your examples, represent multiple outputs. So, I tested this with a step like:

class MyStep extends Step
{
    protected function invoke(mixed $input): Generator
    {
        yield [
            'title' => 'Something',
            'author' => 'Someone',
            'volumes' => [
                'https://www.crwlr.software/packages/crawler/v1.1/steps-and-data-flow/custom-steps',
                'https://www.crwlr.software/packages/crawler/v1.1/steps-and-data-flow/compose-results',
                'https://www.crwlr.software/packages/crawler/v1.1/steps-and-data-flow/groups',
            ]
        ];

        yield [
            'title' => 'Something else',
            'author' => 'Someone else',
            'volumes' => [
                'https://www.crwlr.software/packages/crawler/v1.0/steps-and-data-flow/custom-steps',
                'https://www.crwlr.software/packages/crawler/v1.0/steps-and-data-flow/compose-results',
                'https://www.crwlr.software/packages/crawler/v1.0/steps-and-data-flow/groups',
            ]
        ];
    }
}

You can achieve getting (almost) the result structure, that you want to get, like:

$crawler
    ->input('https://www.example.com')
    ->addStep(
        (new MyStep())->addToResult() // The step that creates the mentioned output with a list of URLs as 'volumes'
    )
    ->addStep(
        Http::get()->useInputKey('volumes') // Using the 'volumes' loads all the URLs and yields them each as a separate output.
    )
    ->addStep(
        Html::root()
            ->extract([...])              // This HTML extract step produces array output
            ->addToResult('volumesData')  // and using `addToResult()` like this, adds those outputs
                                          // with the key `volumesData` to the result.
    );

To explain this a little further: The Result object is initialized when adding the output data from MyStep to the result. When a step that is later in the chain, yields multiple outputs from one input (which is the case for the next Http::get(), because the input that it is invoked with, is one whole output from MyStep. And it produces multiple outputs because it uses the array of volumes), all the outputs add data to only one Result object (See this visualization in the docs). So, when in the third step, it repeatedly adds its data to the volumesData property, it becomes an array in the Result object. Assuming the third step produces outputs like ['foo' => '...', 'bar' => '...'], the final results then look like:

[
    'title' => 'Something',
    'author' => 'Someone',
    'volumes' => [
        'https://www.crwlr.software/packages/crawler/v1.1/steps-and-data-flow/custom-steps',
        'https://www.crwlr.software/packages/crawler/v1.1/steps-and-data-flow/compose-results',
        'https://www.crwlr.software/packages/crawler/v1.1/steps-and-data-flow/groups',
    ],
    'volumesData' => [
           ['foo' => '...', 'bar' => '...'],  // extracted from the first URL from volumes.
           ['foo' => '...', 'bar' => '...'],  // extracted from the second URL from volumes.
           ['foo' => '...', 'bar' => '...'],  // extracted from the third URL from volumes.
    ]

Watch out: you can use addToResult() in different ways:

addToResult() adds all of the output to the result.
addToResult(['title', 'description']) cherry-picks the keys title and description from the output and adds them to the result.
addToResult('foo') adds the whole output as property foo to the result.

I know it's not 100% what you wanted, because it still contains the array of volume URLs separately. I'm thinking about adding some new method like replaceInResult() that you can use instead of addToResult() to solve this problem. What would you think about that?

Ah and btw. I just made a little bugfix, for an issue I discovered when testing this: https://github.com/crwlrsoft/crawler/releases/tag/v1.1.1 So, please upgrade to the latest version when trying this.

Hey @otsch,

Thanks for this comprehensive answer. It works! I had some misunderstanding how this library works mostly due to the fact that i never worked with PHP Generators. After your answer and further research about generators i now have a tiny clue how it works.

For reference, this is the real world crawler i built :)

$crawler->input('https://altraverse.de/manga/')
    ->addStep(Http::get())
    ->addStep(
        Html::getLinks('.navigation--link')
            ->where(Filter::urlPathMatches('^\/manga\/[[:alnum:]]'))
    )
    ->addStep(Http::get()->keepInputData('url')->outputKey('response')->addLaterToResult(['url']))
    ->addStep(
        Html::root()->extract([
            'title' => Dom::cssSelector('.hero--headline')->text(),
            'description' => Dom::cssSelector('.teaser--text-long')->first()->text(),
            'volumeUrls' => Dom::cssSelector('.product--title')->link()
        ])
            // Some pages do not have volumes as it's an upcoming series! null not allowed
            ->refineOutput(function (mixed $outputData, mixed $originalInputData) {
                $outputData['volumeUrls'] = $outputData['volumeUrls'] ?? [];
                return $outputData;
            })
            ->useInputKey("response")->addToResult(['title', 'description'])
    )
    ->addStep(Http::get()->useInputKey('volumeUrls'))
    ->addStep(
        Html::root()->extract([
            'title' => Dom::cssSelector('.product--title')->first()->text(),
            'entry-keys' => Dom::cssSelector('.entry--label')->text(),
            'entry-values' => Dom::cssSelector('.entry--content')->text()
        ])
            ->refineOutput(function (mixed $outputData, mixed $originalInputData) {
                $merged = [];
                foreach ($outputData['entry-keys'] as $i => $key) {
                    $merged[$key] = $outputData['entry-values'][$i];
                }

                $outputData['metadata'] = $merged;
                unset($outputData['entry-keys'], $outputData['entry-values']);
                return $outputData;
            })
            ->addToResult('volumes')
    );

Hey, first of all, thanks for sponsoring! 🫶

I totally understand that. I actually never used Generators before writing this library. In fact I started without using Generators for the steps, but I soon found that they are great in the Crawler/Scraper context to be as memory efficient as possible.

Nice! It's always great to see how people actually use the library. Smart solution that you're just adding title and description to the result in the fourth step, so you don't have the volumeUrls in the final result 👍🏻 And I also like the refiner to get the Metadata in the last step a lot! I was already thinking about a solution for the Html steps to be able to have dynamic output keys using CSS selectors. I'll add this in one of the next versions.

Some minor improvements that you could probably make:

when using the extract() method of Html steps, if you're just providing a CSS selector as string, behind the scenes that is handled like Dom::cssSelector('...')->text(). So you can change lines, like: 'title' => Dom::cssSelector('.hero--headline')->text(), to 'title' => '.hero--headline',
In the third step you forward the output of the second step (the URL) to the output of the third step (using keepInputData('url')) and then use addLaterToResult(['url']). I think you could just achieve that with calling addToResult('url') (for me it looks like it's not necessary to use the addLater... method) on the second step. This way you can also get rid of giving the response an output key and using it in the following step.

So, the first four steps would look like this:

$crawler
    ->input('https://altraverse.de/manga/')
    ->addStep(Http::get())
    ->addStep(
        Html::getLinks('.navigation--link')
            ->where(Filter::urlPathMatches('^\/manga\/[[:alnum:]]'))
            ->addToResult('url')
    )
    ->addStep(Http::get())
    ->addStep(
        Html::root()->extract([
            'title' => '.hero--headline',
            'description' => Dom::cssSelector('.teaser--text-long')->first()->text(),
            'volumeUrls' => Dom::cssSelector('.product--title')->link()
        ])
            // Some pages do not have volumes as it's an upcoming series! null not allowed
            ->refineOutput(function (mixed $outputData, mixed $originalInputData) {
                $outputData['volumeUrls'] = $outputData['volumeUrls'] ?? [];
                return $outputData;
            })
            ->addToResult(['title', 'description'])
    )

And another tip: when the definition of the crawling procedure grows it can hurt readability. I don't know if you like it, but when a procedure grows bigger, I like to put it in a class and build the steps in methods with descriptive names, like:

class MyCrawlingProcedure
{
    private HttpCrawler $crawler;

    public function __construct()
    {
        $this->crawler = HttpCrawler::make()->withBotUserAgent('MyCrawler');

        $this->crawler
            ->input('https://altraverse.de/manga/')
            ->addStep(Http::get())
            ->addStep($this->getLinksFromListPage())
            ->addStep(Http::get())
            ->addStep($this->getDataFromDetailPage())
            ->addStep(Http::get()->useInputKey('volumeUrls'))
            ->addStep($this->getDataFromVolumePage());
    }

    public function run(): Generator
    {
        return $this->crawler->run();
    }

    private function getLinksFromListPage(): StepInterface
    {
        return Html::getLinks('.navigation--link')
            ->where(Filter::urlPathMatches('^\/manga\/[[:alnum:]]'))
            ->addLaterToResult('url');
    }

    private function getDataFromDetailPage(): StepInterface
    {
        return Html::root()->extract([
            'title' => '.hero--headline',
            'description' => Dom::cssSelector('.teaser--text-long')->first()->text(),
            'volumeUrls' => Dom::cssSelector('.product--title')->link()
        ])
            // Some pages do not have volumes as it's an upcoming series! null not allowed
            ->refineOutput(function (mixed $outputData, mixed $originalInputData) {
                $outputData['volumeUrls'] = $outputData['volumeUrls'] ?? [];
                return $outputData;
            })
            ->addToResult(['title', 'description']);
    }

    private function getDataFromVolumePage(): StepInterface
    {
        return Html::root()->extract([
            'title' => Dom::cssSelector('.product--title')->first()->text(),
            'entry-keys' => '.entry--label',
            'entry-values' => '.entry--content',
        ])
            ->refineOutput(function (mixed $outputData, mixed $originalInputData) {
                $merged = [];
                foreach ($outputData['entry-keys'] as $i => $key) {
                    $merged[$key] = $outputData['entry-values'][$i];
                }

                $outputData['metadata'] = $merged;
                unset($outputData['entry-keys'], $outputData['entry-values']);
                return $outputData;
            })
            ->addToResult('volumes');
    }
}

crwlrsoft / crawler

sub steps #103