Add WARC support - Githubissues

andrewferguson commented 8 years ago

The addition of WARC support would be really useful for archiving purposes where Groups data is to be added to an archiving tool such as the wayback machine. This could be achieved fairly easily through an external call to wget, or there are several python libraries that support WARC.

csaftoiu commented 8 years ago

Ah nice! I wasn't aware of WARC.

I'm not entirely sure it's appropriate as right now I'm only recording the messages and not the entire HTML contents. I could modify the crawler to save the entire HTTP of each page, but I found it more interesting to just extract the messages and re-present them in a nicer format.

Perhaps I can simply create a different output (dump_warc instead of dump_site), where the contents of the WARC will be my rendered pages. It won't be a faithful reproduction of the site, but it will contain all the messages. Or would it be appropriate to just upload all the JSON into WARC? Thoughts?

Links for my own reference:

andrewferguson commented 8 years ago

Uploading all of the JSON from the API would be my preference, as this would be a faithful representation of the pages accessed.

Also, I'm not sure if you saw the conversation earlier today on the ArchiveTeam IRC, but PurpleSymphony (who has archived quite a large number of groups over the last few months) has suggested that MongoDB may not be suitable to store group data for a large number of groups due to lack of compression. Therefore it may be useful to just have an option to output WARCs for those who will be archiving a large number of groups.

I'm not sure if the python library you listed has the support to download web pages as WARCs (it can read / write to existing WARC file, but I'm not so sure about creating new ones). Wget can, but I understand if you don't want an external dependency on it.

csaftoiu / yahoo-groups-backup

Add WARC support #1