Try demo of RobotsTxtParser on-line on live domains.
Parsing is carried out according to the rules in accordance with Google & Yandex specifications:
Install the latest version with
composer require bopoda/robots-txt-parser
Run phpunit tests using command
php vendor/bin/phpunit
You can start the parser by getting the content of a robots.txt file from a website:
$parser = new RobotsTxtParser(file_get_contents('http://example.com/robots.txt'));
var_dump($parser->getRules());
Or simply using the contents of the file as input (ie: when the content is already cached):
$parser = new RobotsTxtParser("
User-Agent: *
Disallow: /ajax
Disallow: /search
Clean-param: param1 /path/file.php
User-agent: Yahoo
Disallow: /
Host: example.com
Host: example2.com
");
var_dump($parser->getRules());
This will output:
array(2) {
["*"]=>
array(3) {
["disallow"]=>
array(2) {
[0]=>
string(5) "/ajax"
[1]=>
string(7) "/search"
}
["clean-param"]=>
array(1) {
[0]=>
string(21) "param1 /path/file.php"
}
["host"]=>
string(11) "example.com"
}
["yahoo"]=>
array(1) {
["disallow"]=>
array(1) {
[0]=>
string(1) "/"
}
}
}
In order to validate URL, use the RobotsTxtValidator class:
$parser = new RobotsTxtParser(file_get_contents('http://example.com/robots.txt'));
$validator = new RobotsTxtValidator($parser->getRules());
$url = '/';
$userAgent = 'MyAwesomeBot';
if ($validator->isUrlAllow($url, $userAgent)) {
// Crawl the site URL and do nice stuff
}
Feel free to create PR in this repository. Please, follow PSR style.
See the list of contributors which participated in this project.
Please use v2.0+ version which works by same rules but is more highly performance.