ArchiveTeam / ArchiveBot

ArchiveBot, an IRC bot for archiving websites
http://www.archiveteam.org/index.php?title=ArchiveBot
MIT License
357 stars 72 forks source link

Block certain cookies or cookie values #447

Open JustAnotherArchivist opened 4 years ago

JustAnotherArchivist commented 4 years ago

Some cookies or cookie values have bad effects on the archival. For example, many classical forum softwares let the user choose between different view modes (linear, threaded, hybrid), styles, or languages, but to get a representative archive, we'd only want the default presentation. These things are usually stored in cookies (not the session information, but actual separate cookies). There should be a way to block certain cookies entirely (i.e. they're never stored and sent back on later requests) or to prevent setting certain cookie values (i.e. if a server tries to set it to something else, that's ignored).

The most flexible solution would be to have pairs of a name pattern and a value pattern; if both match a cookie sent by the server, it gets ignored. For cookies we want to ignore entirely, the value pattern could then just be ^ or an empty pattern (which could also be optimised and bypass the regex engine entirely, of course), but it would also allow for pretty much any restriction on the values.

The block list would be stored on the control node and retrieved by or pushed to the pipeline on launching a job, similar to URL ignores but without changes while the job is running.

Example: bb_threadedmode (e.g. job f0i5kb7nl4ltumlaj2wrnptrk)

manu-cyber commented 3 months ago

Another example: vBulletin (e.g. job 86ox20zlr2p59av0w6sau3zzu)