commoncrawl / news-crawl

News crawling with StormCrawler - stores content as WARC
Apache License 2.0
316 stars 34 forks source link

Nutch-compatible implementation of FastURLFilter + use it in PreFilterBolt #59

Closed jnioche closed 9 months ago

jnioche commented 9 months ago

The FastURLFilter implementation currently in SC does not use the same format as the one in Nutch. In order to keep the filtering logic as close as possible when running in production at CommonCrawl, we will add a compatible implementation which can also refresh its data and load it from S3. By default, it will load the file from the resources in the jar.