commoncrawl / news-crawl

News crawling with StormCrawler - stores content as WARC
Apache License 2.0
323 stars 35 forks source link

Crawl-delay in robots.txt should not shrink delay configured by fetcher.server.delay #24

Closed sebastian-nagel closed 5 years ago

sebastian-nagel commented 6 years ago

The news crawler is configured to be polite with a guaranteed fetch delay of few seconds. However, some robots.txt rules define a crawl-delay below one second which then overwrites the the configured delay. The crawler-commons robots.txt parser would allow even a delay of only 1 ms, in practice I've seen a crawl-delay of 200 ms. To keep the control a longer configured delay should take the precedence.

Note: Yandex' robots.txt specs allow fraction numbers for crawl-delay. Examples: bin.ua, vladnews.ru, gov.uk.

sebastian-nagel commented 5 years ago

Included with upgrade to StormCrawler 1.12.1 in 2e36397.