bellingcat / auto-archiver

Automatically archive links to videos, images, and social media content from Google Sheets (and more).
https://pypi.org/project/auto-archiver/
MIT License
581 stars 61 forks source link

wacz file system fix #151

Open djhmateer opened 1 month ago

djhmateer commented 1 month ago

I run the auto-archiver from source, using docker for the wacz_enricher.

I've found that if I have 2 consecutive items to archive, then the second one with throw an exception when any filesystem call is made from Python after running the first wacz_enricher

eg when a Telethon archiver is called (as it reads a .session file).

# on the second item any filesystem call will throw an exception eg this throw with can't find file !
os.getcwd()

My solution is to have a directory volume for docker to write to outside of the directory where the python script is being called from.

https://github.com/djhmateer/auto-archiver/blob/836fbd7733d46ea14fa9615fbda691ad6234f1f6/src/auto_archiver/enrichers/wacz_enricher.py#L105

# old way
# eg /home/dave/auto-archiver/tmpa22nvh69
tmp_dir = ArchivingContext.get_tmp_dir()

# new tmp directory
linux_tmp_dir ='/home/dave/aatmp' 

so it runs

# old way
docker run --rm -v /home/dave/auto-archiver/tmpa22nvh69:/crawls/ webrecorder/browsertrix-crawler crawl --url https://t.me/baznews9/10690 --scopeType page --generateWACZ --text --screenshot fullPage --collection e4422338 --id e4422338 --saveState never --behaviors autoscroll,autoplay,autofetch,siteSpecific --behaviorTimeout 200 --timeout 200 --postLoadDelay 20 --profile /crawls/profile.tar.gz

# new way
docker run --rm -v /home/dave/aatmp:/crawls/ webrecorder/browsertrix-crawler crawl --url https://t.me/baznews9/10690 --scopeType page --generateWACZ --text --screenshot fullPage --collection e4422338 --id e4422338 --saveState never --behaviors autoscroll,autoplay,autofetch,siteSpecific --behaviorTimeout 200 --timeout 200 --postLoadDelay 20 --profile /crawls/profile.tar.gz

This is allowing me to run the wacz_enricher on all links archived.