ArchiveTeam / grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
Other
1.31k stars 129 forks source link

Dupe spotter user-defined list of expressions / separation of default dupe spotter expressions #197

Open acrois opened 2 years ago

acrois commented 2 years ago

process_body(body, url) in dupespotter.py

As an end-user, I would like to be able to modify the dupe spotter expression list and be able to update it during runtime like other configuration options. I would also like to be apply different defaults into more intentional sets of defaults for specific website types.

Right now, it is currently hard-coded into dupespotter.py but may require more thought as to how to expose the list of expressions and keep it up to date (write to file to update).

What happens when the list is updated but an invalid expression is found? I think skipping the line and printing out an error should be sufficient.

ivan commented 2 years ago

Yeah, it would be nice to be able to customize dupespotter. But because most users won't, it probably makes sense to fix it in grab-site for more websites that anyone would care to crawl.

Also note that it was written a while ago and the site-specific parts of it are probably out of date.