Allow fast-urlfilter to load from HDFS/S3 and support gzipped input

jnioche commented 1 year ago

See https://github.com/commoncrawl/nutch/blob/cc/src/java/org/apache/nutch/crawl/Generator2.java#L565 for an example of determining how to load the resource

This will simplify the maintenance of the crawl - there will be no more need for recompiling the jar file to include changes to the filters. The most up to date version of the file will be picked maximum every 3 hours when a new fetch job starts.

The benefit is also that the same filtering file will be used by both the News crawl and the standard one.

https://github.com/commoncrawl/news-crawl-production/pull/1

sebastian-nagel commented 1 year ago

This should be done in (or pushed to) upstream Nutch, since we have no custom modifications in urlfilter-fast.

jnioche commented 1 year ago

https://issues.apache.org/jira/browse/NUTCH-3017

jnioche commented 1 year ago

Not sure how to proceed with storing this in S3 and process with Nutch. The seeds_init script creates a file server/fast-urlfilter.d/fast-urlfilter-http-403-excludes.txt which is then put into the single rule file at the beginning of the crawl.sh script (which calls bootstrap-nutch.sh, which calls fast-urlfilter.sh). So the content of the unique rule file does not consist exclusively of resources tracked with Git and it would be likely to change for every crawl.

We could get the fast-urlfilter.sh script to automatically push to the S3 bucket and overwrite whichever file is already there. The problem is that if someone wanted to add a rule while the crawler is running, the generated file would not contain the http 403 excludes. It also implies that the same file is also valid for the news crawl.

@sebastian-nagel I can't remember how Nutch works: can you have more than one instance of a filter (I don't think you can)? In which case you could have another instance of url-filter just for those 403 and hitting the resouce file in the jar while having the existing fast urlfilter rely on the content of S3

sebastian-nagel commented 1 year ago

No, there's exactly one instance of every plugin. One solution could be to enable filtering in Generator2 (remove the -nofilter flag) and use here the extended rule set including the "dynamic" rules. In Fetcher we use the automatically updated rules from S3. If some URLs escape the dynamic rules via fetcher redirects, that wouldn't be a big deal. If the generator rules are not entirely up-to-date, the URLs would still later skipped when fetcher queues the fetch lists.

jnioche commented 12 months ago

Sounds like a good plan. Rewriting the nutch-site.xml between the generation and the fetching would not really be practical. Can we override the value of urlfilter.fast.file on the command line in crawl.sh and point to the S3 location there?

sebastian-nagel commented 6 months ago

Implemented in NUTCH-3017 merged into branch cc.

wumpus commented 6 months ago

Thank you two for fixing this, it will allow us to nocrawl complainers quickly!

commoncrawl / nutch

Allow fast-urlfilter to load from HDFS/S3 and support gzipped input #26