ArchiveTeam / terroroftinytown

URLTeam's second generation of URL shortener archiving tools
http://urlte.am
MIT License
69 stars 15 forks source link

Create WARC files also, besides the XZ files for the urlshorteners, to make the archived url shorteners available through the wayback machine #1

Open Arkiver2 opened 9 years ago

Arkiver2 commented 9 years ago

It would be very useful if warc.gz files are also made for the url shorteners we are archiving. The chance of people looking in the wayback machine for an url (shortener) is probably bigger then the chance of looking through the .xz files for the shortener they are looking for.

whs commented 9 years ago

Rough design idea:

Which warc library should I use? IA's warc seems to be incompatible with Python 3

chfoo commented 9 years ago

If you want to record as WARC files easily, you'll need an agent that supports recording HTTP traffic accurately to WARC files. Some example agents include Heritrix, Wget, and Wpull but these are web crawlers.

If you can get raw HTTP request and responses from Python Requests, then you try to build a WARC file yourself. I wrote a WARC library called Warcat which is supported under Python 3. I also wrote Wpull which runs under Python 3 and maybe you can take code from it.