ArchiveTeam / grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
Other
1.31k stars 129 forks source link

Consider an option to generate WACZ files after a crawl is done for better replay with ReplayWeb.page #179

Open ikreymer opened 3 years ago

ikreymer commented 3 years ago

It would be helpful for folks using grab-site and then replaying via replayweb.page to have grab-site generate a WACZ file after the crawl is done. (This workflow is mentioned in webrecorder/replayweb.page#6)

WACZ (https://github.com/webrecorder/wacz-format) provides a way to package the WARC, CDX and an optional page list into a single file (a zip file) such that it can be loaded quickly for replay.

The Python wacz library (https://pypi.org/project/wacz) can be used to create the WACZ package (https://github.com/webrecorder/wacz-format/tree/main/py-wacz)

I think should just be able to call the create command from: https://github.com/webrecorder/wacz-format/blob/main/py-wacz/wacz/main.py#L19

It might make sense to pass in a page list, and there is an experimental option to do full-text extraction on pages as well.

The library is still new, so can definitely make any changes needed to support integration!

ivan commented 3 years ago

grab-site currently doesn't really have anyone developing it (I just try to keep the install steps working), but I have no objections to the addition of WACZ support.