crwlrsoft / crawler

Library for Rapid (Web) Crawler and Scraper Development
https://www.crwlr.software/packages/crawler
MIT License
312 stars 11 forks source link

Process URLs from sitemap in chunks #133

Closed derjochenmeyer closed 5 months ago

derjochenmeyer commented 5 months ago

Is there a way to process the URLs from sitemap in chunks of 500 URLs? With a large sitemap and a lot of HTML to extract, the script runs out of memory.

I was expecting that runAndTraverse() would store the results after fetching each URL, but the script runs and writes all the results after fetching all URLs.

$crawler->setStore(new MyStore());

$crawler->input('https://www.example.com/sitemap.xml')
    ->addStep(Http::get())
    ->addStep(Sitemap::getUrlsFromSitemap())
    ->addStep(Http::get())
    ->addStep(
        [...]
    );

$crawler->runAndTraverse();
otsch commented 5 months ago

The library actually uses Generators to work as memory efficient as possible and it passes each response on to the following steps, before loading the next page. A little problem is, that at the end, it has to wait for all outputs, that are children of one and the same previous output, to be processed completely before storing the result. That's because all those child outputs can contribute to the same single parent result object.

If you picture the whole data flow like a tree visualized on the bottom of this page, we can simplify this a bit more:

Could you maybe share your actual crawler code with me? Maybe I can help you more specifically then. You can also share it privately if you want.

derjochenmeyer commented 5 months ago

Thank you for visualizing.

And shure, I can share the code (not refactored, learning / work in progress).

Here is the script I run.

<?php

ini_set('memory_limit', '5G');

require_once 'vendor/autoload.php';
require_once 'MyCrawler.php';
require_once 'MyStore.php';

use Crwlr\Crawler\Crawler;
use Crwlr\Crawler\Steps\Dom;
use Crwlr\Crawler\Steps\Html;
use Crwlr\Crawler\Steps\Loading\Http;
use Crwlr\Crawler\Steps\Sitemap;

MyCrawler::setMemoryLimit('5G');

$crawler = new MyCrawler();

$crawler->monitorMemoryUsage();

// TODO: Find a better place for dbConfig
$dbConfig = [
    'host' => 'H',
    'dbname' => 'N',
    'user' => 'U',
    'password' => 'P'
];

$crawler->setStore(new MyStore($dbConfig));

$crawler->input('https://www.example.com/sitemap.xml')
    ->addStep(Http::get())
    ->addStep(Sitemap::getUrlsFromSitemap())
    ->addStep(Http::get())
    ->addStep(
        Crawler::group()
            ->addStep(
              Html::root()
                    ->extract([
                        'title' => 'h1.title',
                        'article' => Dom::cssSelector('.article')->html(),                
                    ])
            )
            ->addToResult()
            ->addStep(
              Html::metaData()
                  ->only(['keywords', 'author'])
            )
            ->addToResult()
    );

$crawler->runAndTraverse();

This is how MyStore looks like:

<?php

use Crwlr\Crawler\Result;
use Crwlr\Crawler\Stores\Store;

class MyStore extends Store
{
    protected $pdo;
    protected string $tableName;
    protected $dbConfig;

    public function __construct(array $dbConfig)
    {
        $this->dbConfig = $dbConfig;
        $this->tableName = $this->generateTableName();
        $this->createTable();
    }

    protected function connect(): void
    {
        if ($this->pdo === null) {
            $this->pdo = new PDO(
                'mysql:host=' . $this->dbConfig['host'] . ';dbname=' . $this->dbConfig['dbname'],
                $this->dbConfig['user'],
                $this->dbConfig['password'],
                [
                    PDO::ATTR_ERRMODE => PDO::ERRMODE_EXCEPTION,
                    PDO::ATTR_DEFAULT_FETCH_MODE => PDO::FETCH_ASSOC,
                ]
            );
        }
    }

    protected function disconnect(): void
    {
        $this->pdo = null;
    }

    protected function generateTableName(): string
    {
        // Create a unique table name for every crawler run
        return 'articles_' . date("Ymd_His");
    }

    protected function createTable(): void
    {
        $this->connect();

        $query = "CREATE TABLE IF NOT EXISTS {$this->tableName} (
              id        int NOT NULL AUTO_INCREMENT,
              title     varchar(256) NOT NULL DEFAULT '',
              author    varchar(64) NOT NULL DEFAULT '',
              keywords  varchar(128) NOT NULL DEFAULT '',
              article   text,
              PRIMARY KEY(id)
          ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
        ";

        $this->pdo->exec($query);

        $this->disconnect();
    }

    public function store(Result $result): void
    {
      $this->connect();

      $query = "INSERT INTO {$this->tableName} (
                  title,
                  author,
                  keywords,
                  article
                ) VALUES (
                  :title,
                  :author,
                  :keywords,
                  :article
          )";

      // Prepare a statement for execution
      $statement = $this->pdo->prepare($query);
      $statement->bindValue(':title', $result->get('title'));
      $statement->bindValue(':author', $result->get('author'));
      $statement->bindValue(':keywords', $result->get('keywords'));
      $statement->bindValue(':article', $result->get('article'));

      $statement->execute();

      $this->disconnect();

      gc_collect_cycles();
    }
}

MyCrawler looks like this:

<?php

require_once 'vendor/autoload.php';

use Crwlr\Crawler\Cache\FileCache;
use Crwlr\Crawler\HttpCrawler;
use Crwlr\Crawler\UserAgents\UserAgentInterface;
use Crwlr\Crawler\UserAgents\UserAgent;
use Crwlr\Crawler\Loader\Http\HttpLoader;
use Crwlr\Crawler\Loader\LoaderInterface;
use Crwlr\Utils\Microseconds;
use Psr\Log\LoggerInterface;

class MyCrawler extends HttpCrawler
{
    public function loader(UserAgentInterface $userAgent, LoggerInterface $logger): LoaderInterface
    {
        $loader = new HttpLoader($userAgent, logger: $logger);

        $cache = new FileCache(__DIR__ . '/filecache');
        $cache->ttl(43200);

        $loader
            ->setCache($cache)
            ->retryCachedErrorResponses();

        $loader->throttle()
            ->waitBetween(
                Microseconds::fromSeconds(0.1),
                Microseconds::fromSeconds(0.2)
            );

        return $loader;
    }

    protected function userAgent(): UserAgentInterface
    {
        return new UserAgent(
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:99.0) Gecko/20100101 Firefox/99.0'
        );
    }
}
otsch commented 5 months ago

OK, I just ran this Crawler on a news site sitemap containing about 740 articles. The memory usage started at around 6MB and went up to 9MB at the end. I can confirm that the store is never called until the end and in the case of this crawler I would have guessed it should already be called whilest running. I'll investigate this, but I'll probably not find the time to do it this week.

Until then: So, this means the site you're crawling must really be quite huge? Like hundreds of thousands of pages? Or is there maybe an issue with changing the memory limit in your environment? Sometimes PHP doesn't allow you to change the memory limit programmatically. Have you checked if setting the limit actually worked? Like:

Crawler::setMemoryLimit('5G');

var_dump('Actual memory limit is: ' . Crawler::getMemoryLimit());
derjochenmeyer commented 5 months ago

Thank you for the support!

The var_dump returns "Actual memory limit is: 5G".

Running the crawler, the script starts at memory usage: 1578144 and is Killed somewhere above 2250 URLs (between 2250 and 2500 URLs) at memory usage: 548062280. In that case MyStore creates the db table but fails to write any entries.

Pulling the Documents from the fileCache the crawler runs superfast and writes the entries to the db in about a minute using ->maxOutputs(2250)) like this:

$crawler->input('https://www.example.com/sitemap.xml')
    ->addStep(Http::get())
    ->addStep(Sitemap::getUrlsFromSitemap())
    ->addStep(Http::get()->maxOutputs(2250))
    ->addStep(
        Crawler::group()
        [...]
    );

So its not a massive set of URLs. The HTML however sometimes contains some svg and base64 encoded images.

otsch commented 5 months ago

🤔 but the current memory usage comes from PHP's memory_get_usage and it returns the value in bytes. If I'm not wrong 548062280 is just a little above 500 MB. Does it really fail because it runs out of memory?

derjochenmeyer commented 5 months ago

Good question.

Running the script without 'article' => Dom::cssSelector('.article')->html() works fine. And it (only) gets killed at varying URLs with WITH 'article' => Dom::cssSelector('.article')->html() in the result set at roughly the same memory_usage no matter if the result is used for storage or not.

Is there a way to debug why exactly the script is killed?

However I was expecting that the database gets filled while running the crawler, not at the end of the run.

otsch commented 5 months ago

Depends on your setup. Are you just running a simple .php file from the command line like php my-crawler.php? If yes, doesn't it print any error message when it dies? If not, you can try to change error_reporting() to E_ALL, set an error handler or register a shutdown function.

As mentioned I'll investigate why the store isn't called earlier, but I won't have time for this before next week.

derjochenmeyer commented 5 months ago

It seems runAndTraverse() works as expected if inputs() is used ($crawler->inputs($urls)) passing an array of URLs from the sitemap, while using the input() approach in combination with Sitemap::getUrlsFromSitemap() (as in the code I shared above) the results are written at the very end of running the crawler.

otsch commented 5 months ago

So, I now found the time to investigate and fix this. I just tagged v1.6.0 containing a fix for your issue. With 1.6 your crawler should call the store early, no matter if you're using $crawler->inputs() with an array of URLS from the sitemap, or $crawler->input() with the sitemap URL and Sitemap::getUrlsFromSitemap() as first step. Thanks for the report.

Hope this helps @derjochenmeyer? Btw. starring the repo is appreciated 😅 and there's also the option to sponsor me 😅 https://github.com/sponsors/otsch