Small PHP tool to crawl collections of URLs in a Sitemap using the PHPReact library for asynchronous loading of the URLs. Both plain text files and XML Sitemaps are supported.
Crawl the URLs in a Sitemap with verbose logging (-vvv
).
php application.php http://www.domain.ext/sitemap.xml -vvv
Using 15 concurrent connections instead of the default 5 concurrent connections:
php application.php http://www.domain.ext/sitemap.xml --concurrency 15 -vvv
Use a HTTP GET
request instead of the default HTTP HEAD
. Note that HTTP HEAD
requests involve less data transfer since no body is involved:
php application.php http://www.domain.ext/sitemap.xml --requestType GET -vvv
Use a timeout of 3 seconds instead of the default 10 seconds:
php application.php http://www.domain.ext/sitemap.xml --timeout 3 -vvv
Use a specific UserAgent instead of the default Octopus/1.0
, for example, to simulate a search engine crawling a sitemap:
php application.php http://www.domain.ext/sitemap.xml --userAgent 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)' -vvv
Use the TablePresenter
to display intermediate results instead of the default EchoPresenter
:
php application.php http://www.domain.ext/sitemap.xml --presenter Octopus\\Presenter\\TablePresenter -vvv
You can easily integrate sitemap crawling in your own application, have a look at the Config
class for all possible configuration options. If required you can use a PSR3-Logger for logging purposes.
use Octopus\Config;
use Octopus\Processor;
$config = new Config();
$config->concurrency = 2;
$config->targetFile = 'https://www.domain.ext/sitemap.xml';
$config->additionalResponseHeadersToCount = array(
'CF-Cache-Status', //Useful to check CloudFlare edge server cache status
);
$config->requestHeaders = array(
'User-Agent' => 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)', //Simulate Google's webcrawler
);
$processor = new Processor($config, $this->logger); //A PSR3 Logger can be injected if required
$processor->run();
$this->logger->info('Statistics: ' . print_r($processor->result->getStatusCodes(), true));
$this->logger->info('Applied concurrency: ' . $config->concurrency);
$this->logger->info('Total amount of processed data: ' . $processor->result->getTotalData());
$this->logger->info('Failed to load #URLs: ' . count($processor->result->getBrokenUrls()));
Currently, Octopus is mainly an experimental / educational tool. Advanced use cases in HTTP response handling might not be supported.
To run the test suite, you first need to clone this repository and then install all dependencies using Composer:
$ composer install
To run the test suite, go to the project root and run:
$ make test