Spike: Determine work necessary to get rid of Docker in Docker

stxue1 commented 6 months ago

Docker in docker was originally made for Mesos, but most batchsystems nowaday are not as compatible with our current Docker in Docker configuration; there shouldn't be much reason to run the workflow inside a container that is contained inside another container. This is causing some issues with toil-wdl-runner and toil-cwl-runner whenever they try to run an image. If the appliance container isn't given the right permissions, the runners will fail; for example, Funnel wants to run all docker images as read-only with no permissions given for user namespaces, meaning Singularity cannot run. I think this will cause future issues with batch system plugins if they have specific requirements for running docker containers. It also seems like the idea of DinD is generally discouraged/frowned upon.

@adamnovak How hard do you think it would be to get rid of DinD, and do you think it would be a good idea to implement this?

┆Issue is synchronized with this Jira Story ┆Issue Number: TOIL-1562

adamnovak commented 5 months ago

We didn't set it up Docker in Docker for Mesos, as far as I know. On Mesos we use the docker client in the Toil container, but the actual invoked container runs as a sibling, on the host's Docker daemon. We make sure that paths like /var/lib/toil and /tmp are mounted into the Toil container form the host at the same paths as they appear on the host, so that when Toil runs docker -v /tmp/somefile:/some/mount/point, the /tmp/somefile it is talking about that it sees from its own container is also the one that the Docker daemon it is talking to will see and mount.

One approach to get rid of Docker-in-Docker would be to use this sibling container approach in places other than Mesos. This would be fairly straightforward, when it can work. But it can only work when the Toil container can be given direct access to the host's Docker daemon, and it depends on having these paths mounted from the host into the Toil container. I don't think that many Funnel setups will allow you to run a TES task with straight access to the host Docker daemon. It also makes resource accounting and limits difficult: it's no use to limit the memory usage of the Toil container if actually all the memory is used by a sibling Docker container that it started.

Another approach to getting rid of Docker-in-Docker would be to get rid of the Toil container entirely, and only run the container that the workflow is asking to run. This is significantly more challenging; instead of being able to distribute a Toil docker container with all Toil's dependencies installed, we'd need to find a way to inject enough Toil machinery to e.g. download inputs from and upload outputs to a Google job store into basically arbitrary Docker containers. Right now we inject some stuff into containers via the command to be run, as does MiniWDL. But it would be challenging to, for example, install Python 3.12 and Toil itself into a (read-only, on TES) Docker container just by tacking stuff on to the front of a Bash script.

So we'd probably need to do some batch-system-specific stuff to get a portable installation of Toil, along with its Python binary and any libraries it links against, mounted into the task's container. For Kubernetes we might need like a sidecar/initialization container in the pod to set this up in an empty directory that's then mounted into the actual job container, or else a way to interface with Kubernetes's storage subsystem to let us show files to containers at the Kubernetes level. For TES we'd somehow need to get all the stuff into TES's storage and then mount it. (Can TES's storage even mount directories?)

Another approach might be to use Docker primitives to get Toil into the task's container. Fetch the task's container from its Docker registry, create a new container manifest that is based on that but adds a layer containing Toil itself, and then actually run that on Kubernetes or TES or whatever. Then the problem becomes doing the fancy Docker image surgery, and managing to publish the result as something that the Docker daemon can fetch. Either the workflow would need live access to like a Quay account to upload to, or the leader would have to run a Docker registry at a sufficiently public location that the TES server or whatever could fetch from it.

I can see this working as some kind of combination of like a small script injected into the task container and some nonsense like https://pypi.org/project/cx-Freeze/ to let us bootstrap from that and the ability to mount a single file to having a working Toil install. I can also see us having to do some crimes like trying to inject a static ELF binary of an HTTPS client via bash -c because we have some container we need to work in that uses musl instead of glibc and we can't rely on being able to write Toil to anywhere that some arbitrary TES server can read.

I'm not sure there's going to be less magic involved if we manage to do this. But it's definitely worth exploring; maybe it will turn out to be easy actually.

adamnovak commented 5 months ago

For TES, this is basically our version of Snakemake's https://github.com/snakemake/snakemake-executor-plugin-tes/issues/3. And we'd have to solve a lot of the same problems they do w.r.t. using idiomatic TES vs. how to get stuff in there to do things ourselves when we can't.

DataBiosphere / toil

Spike: Determine work necessary to get rid of Docker in Docker #4915