ArchiveTeam / ArchiveBot

ArchiveBot, an IRC bot for archiving websites
http://www.archiveteam.org/index.php?title=ArchiveBot
MIT License
357 stars 71 forks source link

Bulk ignore handling #446

Open JustAnotherArchivist opened 4 years ago

JustAnotherArchivist commented 4 years ago

I just realised that a feature we've been talking about for years in #archivebot still isn't filed here: bulk ignore handling.

The issue at hand is that wpull is fairly slow at handling ignores, at least in the context of large AB jobs with tens to hundreds of millions of URLs. For example, job 8ln624q16o9eghqd8rl6x7lq7 has processed only around 24 million URLs (virtually all ignored) in roughly 12 days. This is because every queue entry has to be checked out from the database, processed, and checked back in; the first and last step further involve SQLite transactions and syncing to disk, which makes this very inefficient. (Also, ArchiveTeam/wpull#427.)

A more efficient solution would be to directly run a database query like UPDATE queued_urls SET status = "skipped" WHERE <url matches ignores> AND status IN ("todo", "error").

Advantages:

Challenges:

Notes: