Consider an option to generate WACZ files after a crawl is done for better replay with ReplayWeb.page

It would be helpful for folks using grab-site and then replaying via replayweb.page to have grab-site generate a WACZ file after the crawl is done. (This workflow is mentioned in webrecorder/replayweb.page#6)

WACZ (https://github.com/webrecorder/wacz-format) provides a way to package the WARC, CDX and an optional page list into a single file (a zip file) such that it can be loaded quickly for replay.

The Python wacz library (https://pypi.org/project/wacz) can be used to create the WACZ package (https://github.com/webrecorder/wacz-format/tree/main/py-wacz)

I think should just be able to call the create command from: https://github.com/webrecorder/wacz-format/blob/main/py-wacz/wacz/main.py#L19

It might make sense to pass in a page list, and there is an experimental option to do full-text extraction on pages as well.

The library is still new, so can definitely make any changes needed to support integration!

ArchiveTeam / grab-site

Consider an option to generate WACZ files after a crawl is done for better replay with ReplayWeb.page #179