bellingcat / auto-archiver

Automatically archive links to videos, images, and social media content from Google Sheets (and more).
https://pypi.org/project/auto-archiver/
MIT License
489 stars 53 forks source link

browsertrix crawler #63

Closed edsu closed 1 year ago

edsu commented 1 year ago

The browsertrix-crawler utility is a browser-based crawler that can crawl one or more pages. browsertrix-crawler creates archives in the WACZ format which is essentially a standardized ZIP file (similar to DOCX, EPUB, JAR, etc) which can then be replayed using the ReplayWeb.page web component, or unzipped to get the original WARC data (the ISO standard format used by the Internet Archive Wayback Machine).

This PR adds browsertrix-crawler to archiver classes where screenshots are made. The WACZ is uploaded to storage and then added to a new column WACZ in the spreadsheet. Another column ReplayWebPage is also added that will display the WACZ, loaded from cloud storage (S3, DigitalOcean, etc) using the client side ReplayWebPage application. You can see an example of the spreadsheet here:

https://docs.google.com/spreadsheets/d/1Tk-iJWzT9Sx2-YccuPttL9HcMdZEnhv_OR7Bc6tfeu8/edit#gid=0

browsertrix-crawler requires Docker to be installed. If Docker is not installed an error message will be logged and things should continue as normal.

If you would like browsertrix-crawler to use a profile to access logged in web pages (which is useful for platforms like Instagram) then you can add a browsertrix / profile stanza to your config.yaml file.

djhmateer commented 1 year ago

Wow - this looks exciting! I look forward to trying this out. Thanks @edsu. I'm a contributor to this project too :-)