bellingcat / auto-archiver

Automatically archive links to videos, images, and social media content from Google Sheets (and more).
https://pypi.org/project/auto-archiver/
MIT License
489 stars 53 forks source link

Add Browsertrix support when using docker image #66

Closed msramalho closed 1 year ago

msramalho commented 1 year ago

Issue: browsertrix-crawler is executed via docker (docker run ...) and it uses volumes to

  1. pass the profile.tar.gz file
  2. save the results of its execution

If the auto-archiver is running inside docker, we have a docker-in-docker situation and that can be nefarious. One workaround is to share the daemon of the host machine with the auto-archiver docker container via /var/run/docker.sock, for example:

docker run --rm -v /var/run/docker.sock:/var/run/docker.sock -v $PWD/secrets:/app/secrets -e SHARED_PATH=$PWD/secrets/crawls aa --config secrets/config-docker.yaml

However, doing this means that the -v volumes passed when doing docker run -v... browsertrix-crawler will only share volumes with the host (and not the docker container running the archiver) meaning the profile file and the results of extraction are host path dependent which adds a layer of complexity.

Additionally, using /var/run/docker.sock is generally undesireable for security as it gives a lot of permissions to the code in the container.

Challenge: can we find a secure and easy to use (both via docker and outside docker) for browsertrix-crawler? Would that be a docker-compose with 2 services communicating? a new service that responds to browsertrix-crawl requests?

edsu commented 1 year ago

Is there a Docker setup for auto-archiver? One option might be to modify that so that browsertrix-crawler is installed into that container? Then auto-archiver would need to know whether to run it directly or via a docker command?

djhmateer commented 1 year ago
If we do go for a Docker version of auto-archiver (is it there already :-)?), this shell script server build setup may help https://github.com/djhmateer/auto-archiver/blob/main/infra/server-build.sh I see you are using similar dependencies! https://github.com/webrecorder/browsertrix-browser-base/blob/main/Dockerfile I've been tripped up in the past with edge cases and dependencies in Docker -eg we need to test FFMPEG and the Gecko driver. I love Docker, but not a panacea. Here be dragons etc.. Memory usage / CPU usage etc.. I'm running auto-archiver on an 8GB VM very well (4GB is fine for 99% of the work). Good idea on installing/layering browsertrix-crawler into an auto-archive container!
edsu commented 1 year ago

Yeah, I'm not necessarily arguing in favor of running auto-archiver in Docker (although it would simplify installation?) I was just responding to @msramalho saying:

If the auto-archiver is running inside docker, we have a docker-in-docker situation ...

Thanks for the shell script @djhmateer!

msramalho commented 1 year ago

I just published a WIP branch called dockerize with the work towards that but I still want to do some refactoring. Essentially, you can do docker build . -t auto-archiver to build a local image and docker run --rm -v $PWD/secrets:/app/secrets aa --config secrets/config.yaml to run it, do note that I moved all the configuration files and session files into /secrets.

My next steps involve making the code structure more manageable so there might be heavy refactoring but the core challenge of this issue remains, and I think the approach @edsu mentioned of installing BC in the docker image is the way to go, happy to hear more on how to achieve that (would a straight up copy-paste of BC's Dockerfile work?)

msramalho commented 1 year ago

and yes, the goal here is to simplify installation and make it easier to deploy/maintain

edsu commented 1 year ago

@msramalho one approach here, instead of copy/paste the BC Dockerfile, would be to extend the image, and layer the auto-archiver into it?

So that would mean changing this line to something like:

FROM webrecorder/browsertrix-crawler:latest

and use the Python environment that BC also needs to run webrecorder/pywb? I wonder if @ikreymer has any advice here?

msramalho commented 1 year ago

Interesting idea, definitely worth testing. I'll wait for Ilya's input too, wondering on possible conflicts or limitations of extending the image :thinking:

loganwilliams commented 1 year ago

Fixed in #74.