ArchiveTeam / ArchiveBot

ArchiveBot, an IRC bot for archiving websites
http://www.archiveteam.org/index.php?title=ArchiveBot
MIT License
352 stars 72 forks source link

Create new ignoreset for excluding any kind of non-visible tracking code or analytics code #197

Open Asparagirl opened 8 years ago

Asparagirl commented 8 years ago

Crawls of big websites can be slowed down by the sheer amount of tracking code and analytics code crud on every page. It would be nice to have an optional ignoreset we can call that just ignores any of it, since it's not usually visible anyway. So this ignoreset would be kind of like Ghostery or uBlock Origin, but for ArchiveBot.

For example:

Place to look for more examples:

Asparagirl commented 8 years ago

Add in an EU cookie pop-up remover, too?

https://github.com/r4vi/block-the-eu-cookie-shit-list/blob/master/filterlist.txt

Asparagirl commented 8 years ago

Oh, and the various AddThis and ShareThis buttons on websites. We're already blocking some of them, but there are new ones popping up we don't screen for yet. Will update this when I find some more concrete examples.

hannahwhy commented 8 years ago

I'm not sure about this.

  1. This sort of stuff is a lot of requests, and it slows down grabs. But that's a problem that can be fixed by speeding up ArchiveBot; see e.g. #182.
  2. We have been grabbing ads, tracking cookies, etc. because it's all part of the page. (It's also fun-depressing to watch a page bloat through the Wayback Machine, and there is something satisfying about click fraud. Assuming ad networks are silly enough to count ArchiveBot crawls as impressions.)
  3. 182 etc aren't likely to happen for a while, and I guess there is not really much harm in having these sorts of things in a notracker ignore set (I would definitely not want them in global). I wouldn't want notracker to become a reflex, though.