Closed msramalho closed 1 year ago
Is there a Docker setup for auto-archiver? One option might be to modify that so that browsertrix-crawler is installed into that container? Then auto-archiver would need to know whether to run it directly or via a docker command?
Yeah, I'm not necessarily arguing in favor of running auto-archiver in Docker (although it would simplify installation?) I was just responding to @msramalho saying:
If the auto-archiver is running inside docker, we have a docker-in-docker situation ...
Thanks for the shell script @djhmateer!
I just published a WIP branch called dockerize
with the work towards that but I still want to do some refactoring.
Essentially, you can do docker build . -t auto-archiver
to build a local image and docker run --rm -v $PWD/secrets:/app/secrets aa --config secrets/config.yaml
to run it, do note that I moved all the configuration files and session files into /secrets
.
My next steps involve making the code structure more manageable so there might be heavy refactoring but the core challenge of this issue remains, and I think the approach @edsu mentioned of installing BC in the docker image is the way to go, happy to hear more on how to achieve that (would a straight up copy-paste of BC's Dockerfile work?)
and yes, the goal here is to simplify installation and make it easier to deploy/maintain
@msramalho one approach here, instead of copy/paste the BC Dockerfile, would be to extend the image, and layer the auto-archiver into it?
So that would mean changing this line to something like:
FROM webrecorder/browsertrix-crawler:latest
and use the Python environment that BC also needs to run webrecorder/pywb? I wonder if @ikreymer has any advice here?
Interesting idea, definitely worth testing. I'll wait for Ilya's input too, wondering on possible conflicts or limitations of extending the image :thinking:
Fixed in #74.
Issue: browsertrix-crawler is executed via docker (
docker run ...
) and it uses volumes toIf the auto-archiver is running inside docker, we have a docker-in-docker situation and that can be nefarious. One workaround is to share the daemon of the host machine with the auto-archiver docker container via
/var/run/docker.sock
, for example:However, doing this means that the
-v
volumes passed when doingdocker run -v... browsertrix-crawler
will only share volumes with the host (and not the docker container running the archiver) meaning the profile file and the results of extraction are host path dependent which adds a layer of complexity.Additionally, using
/var/run/docker.sock
is generally undesireable for security as it gives a lot of permissions to the code in the container.Challenge: can we find a secure and easy to use (both via docker and outside docker) for browsertrix-crawler? Would that be a docker-compose with 2 services communicating? a new service that responds to browsertrix-crawl requests?