-
```
User-agent: *
Disallow: /
```
```
$this->assertTrue($parser->isDisallowed("&&1@|"));
$this->assertFalse($parser->isAllowed('+£€@@1¤'));
```
The two tests above fails, paths allowed according to …
-
Reppy doesn't work past Python 3.8 - seomoz/reppy#122, seomoz/reppy#132 - which means our robots.txt parser isn't working (#81).
Python 3.8 also reaches end-of-life next year so this needs to happen …
-
I had a robots.txt file to process which included the following line, which caused a fatal error:
User-agent:
No user agent was specified, and the robots.txt parser errored when checking a U…
-
I found an example [https://booking.com/robots.txt](https://booking.com/robots.txt) where sitemaps are marked as **Disallowed**
```
Sitemap: https://www.booking.com/sitembk-index-https.xml`
Use…
-
# Bug report
### Bug description:
https://github.com/python/cpython/blob/3.12/Lib/urllib/robotparser.py#L227
`self.path == "*"` will never be `true` because of this line:
https://github.com/python…
-
Hello,
are there any chances to split the current module in two separate modules ? One for robots.txt and one for sitemap.xml ?
I am co-maintain the [crawler4j](https://github.com/yasserg/crawle…
-
See [https://yandex.com/support/webmaster/controlling-robot/robots-txt.xml#clean-param].
Not sure it is part of the standard spec but seems to be used, for example [http://fishki.net/robots.txt].
-
[Crawl-Delay](http://en.wikipedia.org/wiki/Robots_exclusion_standard#Crawl-delay_directive) directive in robots.txt looks useful. If it is present the delay suggested there looks like a good way to ad…
-
CommonCrawl have released a dataset containing robots.txt files - [http://commoncrawl.org/2016/09/robotstxt-and-404-redirect-data-sets/]
This could be used to test our parsing code.
CC @sebastian-na…
-
It would be nice to output a nice and valid robots.txt.