apache / incubator-stormcrawler

A scalable, mature and versatile web crawler based on Apache Storm
https://stormcrawler.apache.org/
Apache License 2.0
883 stars 260 forks source link

Fetcher: optionally slow down fetching from hosts with repeated exceptions #1106

Open jnioche opened 12 months ago

jnioche commented 12 months ago

See NUTCH-2946

The fetcher holds for every fetch queue a counter which counts the number of observed "exceptions" seen when fetching from the host (resp. domain or IP) bound to this queue.

As an improvement to increase the politeness of the crawler, the counter value could be used to dynamically increase the fetch delay for hosts where requests fail repeatedly with exceptions or HTTP status codes mapped to ProtocolStatus.EXCEPTION (HTTP 403 Forbidden, 429 Too many requests, 5xx server errors, etc.) Of course, this should be optional. The aim to reduce the load on such hosts already before the configured max. number of exceptions (property fetcher.max.exceptions.per.queue) is hit.

jnioche commented 11 months ago

https://github.com/apache/nutch/pull/728

jnioche commented 9 months ago

Instead of delaying, which would increase latency, trigger timeouts and fail the tuples. It would be better to assume Fetch errors for the URLs in the queue and push them straight to status. An even better approach would be to have #867 and send data at the queue level so that URLs from it are held for a while. URLFrontier would be a good match for that.