apache / incubator-stormcrawler

A scalable, mature and versatile web crawler based on Apache Storm
https://stormcrawler.apache.org/
Apache License 2.0
883 stars 260 forks source link

Adapting rules for parsing robots.txt file #1042

Closed michaeldinzinger closed 1 year ago

michaeldinzinger commented 1 year ago

Hello all, while crawling, we ran into a politeness issue and we suppose that its cause is that there was apparently a Connection Timeout when trying to fetch the robots.txt. We suppose that as a consequence the other webpages for this host were crawled without any restriction, just as if there would have been a 404 on the robots.txt.

As far as I see, is the logic on parsing the robots.txt file implemented as follows:

            if (code == 200) // found rules: parse them
            {
                String ct = response.getMetadata().getFirstValue(HttpHeaders.CONTENT_TYPE);
                robotRules = parseRules(url.toString(), response.getContent(), ct, agentNames);
            } else if ((code == 403) && (!allowForbidden)) {
                robotRules = FORBID_ALL_RULES; // use forbid all
            } else if (code >= 500) {
                cacheRule = false;
                robotRules = EMPTY_RULES;
            } else robotRules = EMPTY_RULES; // use default rules

In HttpRobotRulesParser.java; line 168-177

More suitable would be a logic as it is described here: https://support.google.com/webmasters/answer/9679690#robots_details To differentiate between the cases:

Please tell me your thoughts on this

rzo1 commented 1 year ago

Sounds valid to apply FORBID_ALL_RULES, if we encounter "too many requests" or a http 5xx.

However, I can also think of usecases in which you still want to apply EMPTY_RULES in such a case (or just stop being polite at all) ;-)

Maybe we can adjust the default here and add a configuration option to just ignore it (similar to http.robots.403.allow) ?

michaeldinzinger commented 1 year ago

Maybe we can adjust the default here and add a configuration option to just ignore it (similar to http.robots.403.allow) ?

Thank you, sounds good:) Maybe something like http.robots.connectionerror.skip or http.robots.5xx.allow which defaults to false. The code could then look like this

            robotRules = FORBID_ALL_RULES; // use default rules
            if (code == 200) // found rules: parse them
            {
                String ct = response.getMetadata().getFirstValue(HttpHeaders.CONTENT_TYPE);
                robotRules = parseRules(url.toString(), response.getContent(), ct, agentNames);
            } else if (code == 403 && allowForbidden) {
                robotRules = EMPTY_RULES; // allow all
            } else if (code >= 500) {
                cacheRule = false;
                if (allow5xx) {
                    robotRules = EMPTY_RULES; // allow all
                }
            }
sebastian-nagel commented 1 year ago

Also the recently published RFC 9309 requires to treat a failure to fetch the robots.txt with HTTP status 500-599 as "complete disallow" (see section "Unreachable" Status). However, if the 5xx status of the robots.txt is observed over a longer period of time, crawlers may assume that there is none (ie. EMPTY_RULES).

Nutch handles 5xx failures and after few retries to fetch the robots.txt suspends crawling content from the given site. See NUTCH-2573 and apache/nutch#724. Since fetch queues are implemented similarly in Nutch and StormCrawler, this mechanism could be ported to StormCrawler. Eventually, it'd be better to not just drop the URLs/tuples but to update nextFetchDate (by adding 1 hour or 1 day) to avoid that the spout is releasing the same URLs again and again into the topology.

jnioche commented 1 year ago

Thanks for this discussion people!

Eventually, it'd be better to not just drop the URLs/tuples but to update nextFetchDate (by adding 1 hour or 1 day) to avoid that the spout is releasing the same URLs again and again into the topology.

This could be done for an entire host with the mechanism suggested in #867. I have started working on it for the OpenSearch backend in branch 990 but still early days

sebastian-nagel commented 1 year ago

the mechanism suggested in #867

Nice!

Just as a note: when running the Common Crawl crawls, temporarily suspending fetching from sites with a robots.txt 5xx HTTP status saved a lot of work responding to complaints from webmasters (sent automatically as abuse reports to AWS). This was in combination with a general slow down (exponential backoff) on HTTP 5xx, 403 Forbidden and 429 Too many requests (see NUTCH-2946).

michaeldinzinger commented 1 year ago

Also the recently published RFC 9309 requires to treat a failure to fetch the robots.txt with HTTP status 500-599 as "complete disallow" (see section "Unreachable" Status). However, if the 5xx status of the robots.txt is observed over a longer period of time, crawlers may assume that there is none (ie. EMPTY_RULES).

Thank you, very interesting:) As far as I understand, a possible modification would be to adapt the current handling of HTTP 429 and 5xx and, concretely, set FORBID_ALL_RULES as a default instead of EMPTY_RULES. This is necessary to meet the recently published RFC9309 requirements. A long-term solution is to also add the parameters and the underlying mechanism (#867) to retry to crawl the robots.txt a few times (for the case of HTTP 503 and maybe also 429), before just being satisfied with saying "FORBID_ALL_RULES". In either way, a host would only be temporarily suspended from crawling, because the SC will try to crawl the robots.txt again as soon as it's not in the error_cache anymore. And as an add-on for the long-term solution, the RFC9309 would even allow to bypass the suspension of a host due to a 5xx error on the robots.txt after getting the same server error for e.g. 30 days. But I don't see how this is easily implementable. So maybe it's better to just settle for a short-term solution for now?