Closed derjochenmeyer closed 5 months ago
The library actually uses Generators to work as memory efficient as possible and it passes each response on to the following steps, before loading the next page. A little problem is, that at the end, it has to wait for all outputs, that are children of one and the same previous output, to be processed completely before storing the result. That's because all those child outputs can contribute to the same single parent result object.
If you picture the whole data flow like a tree visualized on the bottom of this page, we can simplify this a bit more:
Could you maybe share your actual crawler code with me? Maybe I can help you more specifically then. You can also share it privately if you want.
Thank you for visualizing.
And shure, I can share the code (not refactored, learning / work in progress).
Here is the script I run.
<?php
ini_set('memory_limit', '5G');
require_once 'vendor/autoload.php';
require_once 'MyCrawler.php';
require_once 'MyStore.php';
use Crwlr\Crawler\Crawler;
use Crwlr\Crawler\Steps\Dom;
use Crwlr\Crawler\Steps\Html;
use Crwlr\Crawler\Steps\Loading\Http;
use Crwlr\Crawler\Steps\Sitemap;
MyCrawler::setMemoryLimit('5G');
$crawler = new MyCrawler();
$crawler->monitorMemoryUsage();
// TODO: Find a better place for dbConfig
$dbConfig = [
'host' => 'H',
'dbname' => 'N',
'user' => 'U',
'password' => 'P'
];
$crawler->setStore(new MyStore($dbConfig));
$crawler->input('https://www.example.com/sitemap.xml')
->addStep(Http::get())
->addStep(Sitemap::getUrlsFromSitemap())
->addStep(Http::get())
->addStep(
Crawler::group()
->addStep(
Html::root()
->extract([
'title' => 'h1.title',
'article' => Dom::cssSelector('.article')->html(),
])
)
->addToResult()
->addStep(
Html::metaData()
->only(['keywords', 'author'])
)
->addToResult()
);
$crawler->runAndTraverse();
This is how MyStore looks like:
<?php
use Crwlr\Crawler\Result;
use Crwlr\Crawler\Stores\Store;
class MyStore extends Store
{
protected $pdo;
protected string $tableName;
protected $dbConfig;
public function __construct(array $dbConfig)
{
$this->dbConfig = $dbConfig;
$this->tableName = $this->generateTableName();
$this->createTable();
}
protected function connect(): void
{
if ($this->pdo === null) {
$this->pdo = new PDO(
'mysql:host=' . $this->dbConfig['host'] . ';dbname=' . $this->dbConfig['dbname'],
$this->dbConfig['user'],
$this->dbConfig['password'],
[
PDO::ATTR_ERRMODE => PDO::ERRMODE_EXCEPTION,
PDO::ATTR_DEFAULT_FETCH_MODE => PDO::FETCH_ASSOC,
]
);
}
}
protected function disconnect(): void
{
$this->pdo = null;
}
protected function generateTableName(): string
{
// Create a unique table name for every crawler run
return 'articles_' . date("Ymd_His");
}
protected function createTable(): void
{
$this->connect();
$query = "CREATE TABLE IF NOT EXISTS {$this->tableName} (
id int NOT NULL AUTO_INCREMENT,
title varchar(256) NOT NULL DEFAULT '',
author varchar(64) NOT NULL DEFAULT '',
keywords varchar(128) NOT NULL DEFAULT '',
article text,
PRIMARY KEY(id)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
";
$this->pdo->exec($query);
$this->disconnect();
}
public function store(Result $result): void
{
$this->connect();
$query = "INSERT INTO {$this->tableName} (
title,
author,
keywords,
article
) VALUES (
:title,
:author,
:keywords,
:article
)";
// Prepare a statement for execution
$statement = $this->pdo->prepare($query);
$statement->bindValue(':title', $result->get('title'));
$statement->bindValue(':author', $result->get('author'));
$statement->bindValue(':keywords', $result->get('keywords'));
$statement->bindValue(':article', $result->get('article'));
$statement->execute();
$this->disconnect();
gc_collect_cycles();
}
}
MyCrawler looks like this:
<?php
require_once 'vendor/autoload.php';
use Crwlr\Crawler\Cache\FileCache;
use Crwlr\Crawler\HttpCrawler;
use Crwlr\Crawler\UserAgents\UserAgentInterface;
use Crwlr\Crawler\UserAgents\UserAgent;
use Crwlr\Crawler\Loader\Http\HttpLoader;
use Crwlr\Crawler\Loader\LoaderInterface;
use Crwlr\Utils\Microseconds;
use Psr\Log\LoggerInterface;
class MyCrawler extends HttpCrawler
{
public function loader(UserAgentInterface $userAgent, LoggerInterface $logger): LoaderInterface
{
$loader = new HttpLoader($userAgent, logger: $logger);
$cache = new FileCache(__DIR__ . '/filecache');
$cache->ttl(43200);
$loader
->setCache($cache)
->retryCachedErrorResponses();
$loader->throttle()
->waitBetween(
Microseconds::fromSeconds(0.1),
Microseconds::fromSeconds(0.2)
);
return $loader;
}
protected function userAgent(): UserAgentInterface
{
return new UserAgent(
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:99.0) Gecko/20100101 Firefox/99.0'
);
}
}
OK, I just ran this Crawler on a news site sitemap containing about 740 articles. The memory usage started at around 6MB and went up to 9MB at the end. I can confirm that the store is never called until the end and in the case of this crawler I would have guessed it should already be called whilest running. I'll investigate this, but I'll probably not find the time to do it this week.
Until then: So, this means the site you're crawling must really be quite huge? Like hundreds of thousands of pages? Or is there maybe an issue with changing the memory limit in your environment? Sometimes PHP doesn't allow you to change the memory limit programmatically. Have you checked if setting the limit actually worked? Like:
Crawler::setMemoryLimit('5G');
var_dump('Actual memory limit is: ' . Crawler::getMemoryLimit());
Thank you for the support!
The var_dump
returns "Actual memory limit is: 5G"
.
Running the crawler, the script starts at memory usage: 1578144
and is Killed
somewhere above 2250 URLs (between 2250 and 2500 URLs) at memory usage: 548062280
. In that case MyStore creates the db table but fails to write any entries.
Pulling the Documents from the fileCache the crawler runs superfast and writes the entries to the db in about a minute using ->maxOutputs(2250)) like this:
$crawler->input('https://www.example.com/sitemap.xml')
->addStep(Http::get())
->addStep(Sitemap::getUrlsFromSitemap())
->addStep(Http::get()->maxOutputs(2250))
->addStep(
Crawler::group()
[...]
);
So its not a massive set of URLs. The HTML however sometimes contains some svg and base64 encoded images.
🤔 but the current memory usage comes from PHP's memory_get_usage
and it returns the value in bytes. If I'm not wrong 548062280
is just a little above 500 MB. Does it really fail because it runs out of memory?
Good question.
Running the script without 'article' => Dom::cssSelector('.article')->html()
works fine. And it (only) gets killed at varying URLs with WITH 'article' => Dom::cssSelector('.article')->html()
in the result set at roughly the same memory_usage no matter if the result is used for storage or not.
Is there a way to debug why exactly the script is killed?
However I was expecting that the database gets filled while running the crawler, not at the end of the run.
Depends on your setup. Are you just running a simple .php
file from the command line like php my-crawler.php
? If yes, doesn't it print any error message when it dies? If not, you can try to change error_reporting() to E_ALL
, set an error handler or register a shutdown function.
As mentioned I'll investigate why the store isn't called earlier, but I won't have time for this before next week.
It seems runAndTraverse()
works as expected if inputs()
is used ($crawler->inputs($urls)
) passing an array of URLs from the sitemap, while using the input()
approach in combination with Sitemap::getUrlsFromSitemap()
(as in the code I shared above) the results are written at the very end of running the crawler.
So, I now found the time to investigate and fix this. I just tagged v1.6.0 containing a fix for your issue. With 1.6
your crawler should call the store early, no matter if you're using $crawler->inputs()
with an array of URLS from the sitemap, or $crawler->input()
with the sitemap URL and Sitemap::getUrlsFromSitemap()
as first step.
Thanks for the report.
Hope this helps @derjochenmeyer? Btw. starring the repo is appreciated 😅 and there's also the option to sponsor me 😅 https://github.com/sponsors/otsch
Is there a way to process the URLs from sitemap in chunks of 500 URLs? With a large sitemap and a lot of HTML to extract, the script runs out of memory.
I was expecting that
runAndTraverse()
would store the results after fetching each URL, but the script runs and writes all the results after fetching all URLs.