ArchiveTeam / ArchiveBot

ArchiveBot, an IRC bot for archiving websites
http://www.archiveteam.org/index.php?title=ArchiveBot
MIT License
352 stars 72 forks source link

Regarding Always applied igsets and Global igsets #568

Closed Flashfire42 closed 7 months ago

Flashfire42 commented 7 months ago

Would it be a wise idea to roll the bad videos and some of the blogs igsets into the global ignores? They are the 2 most commonly applied igsets and some of these patterns could surely be rolled into the Global Ignores?

JustAnotherArchivist commented 7 months ago

Every ignore pattern adds computational overhead because it needs to be checked for every URL. So the global igset should generally be kept as small as possible (and I think could use some cleanup).

Some quick statistics for the first 9 months of this year from my log files: 73237 jobs, 38872 (53.1 %) of them recursive. The top ignore sets are 10698 (14.6 %) badvideos, 8572 (11.7 %) blogs, 3616 (4.9 %) notweets, 1996 (2.7 %) forums.

In other words, over 85 % of jobs never get igsets beyond global. Some of them might retrieve URLs that would get ignored, but still, the vast majority would just be slowed down by adding these igsets.

This is not a fix for your lazy job submission practice.