ArchiveTeam / ArchiveBot

ArchiveBot, an IRC bot for archiving websites
http://www.archiveteam.org/index.php?title=ArchiveBot
MIT License
356 stars 72 forks source link

Enable HTTP compression #520

Open JustAnotherArchivist opened 2 years ago

JustAnotherArchivist commented 2 years ago

AB currently doesn't make use of wpull's --http-compression option, so it doesn't send an Accept-Encoding header. Occasionally, there are websites which hate that. For example, https://www.cresta-awards.com/ sends an empty response body when compression isn't enabled, and https://www-ssrl.slac.stanford.edu/~swebb/ simply kills the connection.

Since browsers seem to send Accept-Encoding: gzip, deflate (or possibly brotli too these days) on all requests, it should probably be safe to enable this globally. It might cause a very small increase in WARC size because web servers are unlikely to always compress data at the highest compression level (as wpull does for writing WARCs), and working with compressed data inside compressed WARCs is slightly annoying, but those are just minor, irrelevant downsides.