WATonomous / run-gha-on-slurm

Run Github Actions on Slurm
0 stars 0 forks source link

Run docker in docker #1

Closed alexboden closed 2 months ago

alexboden commented 3 months ago

Since CI jobs need access to docker and the runner is itself a docker image we need to run docker in docker.

alexboden commented 3 months ago

I'm able to run dind in the login node but not during a slurm job. I suspect this may be due to the lack of cgroups in a slurm job. I get this warning after I run docker info in slurm. WARNING: Running in rootless-mode without cgroups. Systemd is required to enable cgroups in rootless-mode.

I'm able to run dind in the login node but not in a Slurm job. Slurm:

alexboden@thor-slurm1:~$ docker run --privileged -d --name dind-test docker:dind
7569d46d300eea0db8d73f1f69a77aeafeb0798d26c062a994ac7c4f221f7baf
alexboden@thor-slurm1:~$ docker logs dind-test
Certificate request self-signature ok
subject=CN=docker:dind server
/certs/server/cert.pem: OK
Certificate request self-signature ok
subject=CN=docker:dind client
/certs/client/cert.pem: OK
cat: can't open '/proc/net/ip6_tables_names': No such file or directory
cat: can't open '/proc/net/arp_tables_names': No such file or directory
iptables v1.8.10 (nf_tables)
mount: permission denied (are you root?)
Could not mount /sys/kernel/security.
AppArmor detection and --privileged mode might break.
mkdir: can't create directory '/sys/fs/cgroup/init': Permission denied
alexboden@thor-slurm1:~$ 

Login Node:

alexboden@derek3-ubuntu2:~$ docker run --privileged -d --name dind-test docker:dind
930ab3b247185513e82d4868af1de9a99c27127dd2b8592da21a835bff64c957
alexboden@derek3-ubuntu2:~$ docker logs dind-test
Certificate request self-signature ok
subject=CN=docker:dind server
/certs/server/cert.pem: OK
Certificate request self-signature ok
subject=CN=docker:dind client
/certs/client/cert.pem: OK
cat: can't open '/proc/net/ip6_tables_names': No such file or directory
cat: can't open '/proc/net/arp_tables_names': No such file or directory
iptables v1.8.10 (nf_tables)
mount: permission denied (are you root?)
Could not mount /sys/kernel/security.
AppArmor detection and --privileged mode might break.
time="2024-06-06T00:20:56.123906070Z" level=info msg="Starting up"
time="2024-06-06T00:20:56.146438805Z" level=info msg="containerd not running, starting managed containerd"
time="2024-06-06T00:20:56.147487414Z" level=info msg="started new containerd process" address=/var/run/docker/containerd/containerd.sock module=libcontainerd pid=64
time="2024-06-06T00:20:56.167240887Z" level=info msg="starting containerd" revision=926c9586fe4a6236699318391cd44976a98e31f1 version=v1.7.15
time="2024-06-06T00:20:56.183103199Z" level=info msg="loading plugin \"io.containerd.event.v1.exchange\"..." type=io.containerd.event.v1
time="2024-06-06T00:20:56.183133560Z" level=info msg="loading plugin \"io.containerd.internal.v1.opt\"..." type=io.containerd.internal.v1
time="2024-06-06T00:20:56.183384633Z" level=info msg="loading plugin \"io.containerd.warning.v1.deprecations\"..." type=io.containerd.warning.v1
time="2024-06-06T00:20:56.183420006Z" level=info msg="loading plugin \"io.containerd.snapshotter.v1.blockfile\"..." type=io.containerd.snapshotter.v1
time="2024-06-06T00:20:56.183471868Z" level=info msg="skip loading plugin \"io.containerd.snapshotter.v1.blockfile\"..." error="no scratch file generator: skip plugin" type=io.containerd.snapshotter.v1
time="2024-06-06T00:20:56.183492148Z" level=info msg="loading plugin \"io.containerd.snapshotter.v1.devmapper\"..." type=io.containerd.snapshotter.v1
time="2024-06-06T00:20:56.183503045Z" level=warning msg="failed to load plugin io.containerd.snapshotter.v1.devmapper" error="devmapper not configured"
time="2024-06-06T00:20:56.183512138Z" level=info msg="loading plugin \"io.containerd.snapshotter.v1.native\"..." type=io.containerd.snapshotter.v1
time="2024-06-06T00:20:56.183561536Z" level=info msg="loading plugin \"io.containerd.snapshotter.v1.overlayfs\"..." type=io.containerd.snapshotter.v1
time="2024-06-06T00:20:56.183879927Z" level=info msg="loading plugin \"io.containerd.snapshotter.v1.aufs\"..." type=io.containerd.snapshotter.v1
time="2024-06-06T00:20:56.186577734Z" level=info msg="skip loading plugin \"io.containerd.snapshotter.v1.aufs\"..." error="aufs is not supported (modprobe aufs failed: exit status 1 \"ip: can't find device 'aufs'\\nmodprobe: can't change directory to '/lib/modules': No such file or directory\\n\"): skip plugin" type=io.containerd.snapshotter.v1
time="2024-06-06T00:20:56.186605976Z" level=info msg="loading plugin \"io.containerd.snapshotter.v1.zfs\"..." type=io.containerd.snapshotter.v1
time="2024-06-06T00:20:56.187458362Z" level=info msg="skip loading plugin \"io.containerd.snapshotter.v1.zfs\"..." error="path /var/lib/docker/containerd/daemon/io.containerd.snapshotter.v1.zfs must be a zfs filesystem to be used with the zfs snapshotter: skip plugin" type=io.containerd.snapshotter.v1
time="2024-06-06T00:20:56.187489819Z" level=info msg="loading plugin \"io.containerd.content.v1.content\"..." type=io.containerd.content.v1
time="2024-06-06T00:20:56.187625207Z" level=info msg="loading plugin \"io.containerd.metadata.v1.bolt\"..." type=io.containerd.metadata.v1
time="2024-06-06T00:20:56.187671763Z" level=warning msg="could not use snapshotter devmapper in metadata plugin" error="devmapper not configured"
time="2024-06-06T00:20:56.187916693Z" level=info msg="metadata content store policy set" policy=shared
time="2024-06-06T00:20:56.215703508Z" level=info msg="loading plugin \"io.containerd.gc.v1.scheduler\"..." type=io.containerd.gc.v1
time="2024-06-06T00:20:56.215903459Z" level=info msg="loading plugin \"io.containerd.differ.v1.walking\"..." type=io.containerd.differ.v1
time="2024-06-06T00:20:56.216048241Z" level=info msg="loading plugin \"io.containerd.lease.v1.manager\"..." type=io.containerd.lease.v1
time="2024-06-06T00:20:56.216150449Z" level=info msg="loading plugin \"io.containerd.streaming.v1.manager\"..." type=io.containerd.streaming.v1
time="2024-06-06T00:20:56.216187695Z" level=info msg="loading plugin \"io.containerd.runtime.v1.linux\"..." type=io.containerd.runtime.v1
time="2024-06-06T00:20:56.216322002Z" level=info msg="loading plugin \"io.containerd.monitor.v1.cgroups\"..." type=io.containerd.monitor.v1
time="2024-06-06T00:20:56.216551837Z" level=info msg="loading plugin \"io.containerd.runtime.v2.task\"..." type=io.containerd.runtime.v2
time="2024-06-06T00:20:56.216675383Z" level=info msg="loading plugin \"io.containerd.runtime.v2.shim\"..." type=io.containerd.runtime.v2
time="2024-06-06T00:20:56.216693677Z" level=info msg="loading plugin \"io.containerd.sandbox.store.v1.local\"..." type=io.containerd.sandbox.store.v1
time="2024-06-06T00:20:56.216704810Z" level=info msg="loading plugin \"io.containerd.sandbox.controller.v1.local\"..." type=io.containerd.sandbox.controller.v1
time="2024-06-06T00:20:56.216715637Z" level=info msg="loading plugin \"io.containerd.service.v1.containers-service\"..." type=io.containerd.service.v1
time="2024-06-06T00:20:56.216725207Z" level=info msg="loading plugin \"io.containerd.service.v1.content-service\"..." type=io.containerd.service.v1
time="2024-06-06T00:20:56.216736028Z" level=info msg="loading plugin \"io.containerd.service.v1.diff-service\"..." type=io.containerd.service.v1
time="2024-06-06T00:20:56.216747279Z" level=info msg="loading plugin \"io.containerd.service.v1.images-service\"..." type=io.containerd.service.v1
time="2024-06-06T00:20:56.216758549Z" level=info msg="loading plugin \"io.containerd.service.v1.introspection-service\"..." type=io.containerd.service.v1
time="2024-06-06T00:20:56.216769350Z" level=info msg="loading plugin \"io.containerd.service.v1.namespaces-service\"..." type=io.containerd.service.v1
time="2024-06-06T00:20:56.216778882Z" level=info msg="loading plugin \"io.containerd.service.v1.snapshots-service\"..." type=io.containerd.service.v1
time="2024-06-06T00:20:56.216787934Z" level=info msg="loading plugin \"io.containerd.service.v1.tasks-service\"..." type=io.containerd.service.v1
time="2024-06-06T00:20:56.216809864Z" level=info msg="loading plugin \"io.containerd.grpc.v1.containers\"..." type=io.containerd.grpc.v1
time="2024-06-06T00:20:56.216821363Z" level=info msg="loading plugin \"io.containerd.grpc.v1.content\"..." type=io.containerd.grpc.v1
time="2024-06-06T00:20:56.216831584Z" level=info msg="loading plugin \"io.containerd.grpc.v1.diff\"..." type=io.containerd.grpc.v1
time="2024-06-06T00:20:56.216841062Z" level=info msg="loading plugin \"io.containerd.grpc.v1.events\"..." type=io.containerd.grpc.v1
time="2024-06-06T00:20:56.216850880Z" level=info msg="loading plugin \"io.containerd.grpc.v1.images\"..." type=io.containerd.grpc.v1
time="2024-06-06T00:20:56.216861397Z" level=info msg="loading plugin \"io.containerd.grpc.v1.introspection\"..." type=io.containerd.grpc.v1
time="2024-06-06T00:20:56.216871266Z" level=info msg="loading plugin \"io.containerd.grpc.v1.leases\"..." type=io.containerd.grpc.v1
time="2024-06-06T00:20:56.216881075Z" level=info msg="loading plugin \"io.containerd.grpc.v1.namespaces\"..." type=io.containerd.grpc.v1
time="2024-06-06T00:20:56.216891360Z" level=info msg="loading plugin \"io.containerd.grpc.v1.sandbox-controllers\"..." type=io.containerd.grpc.v1
time="2024-06-06T00:20:56.216910825Z" level=info msg="loading plugin \"io.containerd.grpc.v1.sandboxes\"..." type=io.containerd.grpc.v1
time="2024-06-06T00:20:56.216920073Z" level=info msg="loading plugin \"io.containerd.grpc.v1.snapshots\"..." type=io.containerd.grpc.v1
time="2024-06-06T00:20:56.216929416Z" level=info msg="loading plugin \"io.containerd.grpc.v1.streaming\"..." type=io.containerd.grpc.v1
time="2024-06-06T00:20:56.216938800Z" level=info msg="loading plugin \"io.containerd.grpc.v1.tasks\"..." type=io.containerd.grpc.v1
time="2024-06-06T00:20:56.216951135Z" level=info msg="loading plugin \"io.containerd.transfer.v1.local\"..." type=io.containerd.transfer.v1
time="2024-06-06T00:20:56.216974773Z" level=info msg="loading plugin \"io.containerd.grpc.v1.transfer\"..." type=io.containerd.grpc.v1
time="2024-06-06T00:20:56.216987633Z" level=info msg="loading plugin \"io.containerd.grpc.v1.version\"..." type=io.containerd.grpc.v1
time="2024-06-06T00:20:56.216997684Z" level=info msg="loading plugin \"io.containerd.internal.v1.restart\"..." type=io.containerd.internal.v1
time="2024-06-06T00:20:56.217058879Z" level=info msg="loading plugin \"io.containerd.tracing.processor.v1.otlp\"..." type=io.containerd.tracing.processor.v1
time="2024-06-06T00:20:56.217076789Z" level=info msg="skip loading plugin \"io.containerd.tracing.processor.v1.otlp\"..." error="no OpenTelemetry endpoint: skip plugin" type=io.containerd.tracing.processor.v1
time="2024-06-06T00:20:56.217085405Z" level=info msg="loading plugin \"io.containerd.internal.v1.tracing\"..." type=io.containerd.internal.v1
time="2024-06-06T00:20:56.217093074Z" level=info msg="skipping tracing processor initialization (no tracing plugin)" error="no OpenTelemetry endpoint: skip plugin"
time="2024-06-06T00:20:56.217342675Z" level=info msg="loading plugin \"io.containerd.grpc.v1.healthcheck\"..." type=io.containerd.grpc.v1
time="2024-06-06T00:20:56.217363188Z" level=info msg="loading plugin \"io.containerd.nri.v1.nri\"..." type=io.containerd.nri.v1
time="2024-06-06T00:20:56.217374973Z" level=info msg="NRI interface is disabled by configuration."
time="2024-06-06T00:20:56.217597656Z" level=info msg=serving... address=/var/run/docker/containerd/containerd-debug.sock
time="2024-06-06T00:20:56.217677398Z" level=info msg=serving... address=/var/run/docker/containerd/containerd.sock.ttrpc
time="2024-06-06T00:20:56.217741219Z" level=info msg=serving... address=/var/run/docker/containerd/containerd.sock
time="2024-06-06T00:20:56.217756825Z" level=info msg="containerd successfully booted in 0.052502s"
time="2024-06-06T00:20:57.214218659Z" level=info msg="Loading containers: start."
time="2024-06-06T00:20:57.341385872Z" level=info msg="Loading containers: done."
time="2024-06-06T00:20:57.346554612Z" level=warning msg="Not using native diff for overlay2, this may cause degraded performance for building images: running in a user namespace" storage-driver=overlay2
time="2024-06-06T00:20:57.346644540Z" level=warning msg="WARNING: No cpu cfs quota support"
time="2024-06-06T00:20:57.346660457Z" level=warning msg="WARNING: No cpu cfs period support"
time="2024-06-06T00:20:57.346665860Z" level=warning msg="WARNING: No cpu shares support"
time="2024-06-06T00:20:57.346670146Z" level=warning msg="WARNING: No cpuset support"
time="2024-06-06T00:20:57.346674277Z" level=warning msg="WARNING: No io.weight support"
time="2024-06-06T00:20:57.346681789Z" level=warning msg="WARNING: No io.weight (per device) support"
time="2024-06-06T00:20:57.346686235Z" level=warning msg="WARNING: No io.max (rbps) support"
time="2024-06-06T00:20:57.346690644Z" level=warning msg="WARNING: No io.max (wbps) support"
time="2024-06-06T00:20:57.346694733Z" level=warning msg="WARNING: No io.max (riops) support"
time="2024-06-06T00:20:57.346700435Z" level=warning msg="WARNING: No io.max (wiops) support"
time="2024-06-06T00:20:57.346716120Z" level=info msg="Docker daemon" commit=8e96db1 containerd-snapshotter=false storage-driver=overlay2 version=26.1.3
time="2024-06-06T00:20:57.346866794Z" level=info msg="Daemon has completed initialization"
time="2024-06-06T00:20:57.411068565Z" level=info msg="API listen on [::]:2376"
time="2024-06-06T00:20:57.411070597Z" level=info msg="API listen on /var/run/docker.sock"
alexboden@derek3-ubuntu2:~$ 

May have something to do with this: If I run docker info in a login node I get this line Cgroup Driver: systemd in the ouput If I run docker info in a slurm job I get Cgroup Driver: none

Perhaps we need to add "exec-opts": ["native.cgroupdriver=cgroupfs"] to /etc/docker/daemon.json https://stackoverflow.com/questions/43794169/docker-change-cgroup-driver-to-systemd EDIT: this was already enabled as a command line arg in the slurm_start_docker script

alexboden commented 3 months ago

We could try using https://github.com/nestybox/sysbox?tab=readme-ov-file as suggested in https://jpetazzo.github.io/2015/09/03/do-not-use-docker-in-docker-for-ci/

ben-z commented 3 months ago

Sysbox looks fancy! Seems very involved though. We could try it if we can't find easier solutions.

I'm able to run dind in the login node but not during a slurm job. I suspect this may be due to the lack of cgroups in a slurm job. I get this warning after I run docker info in slurm. WARNING: Running in rootless-mode without cgroups. Systemd is required to enable cgroups in rootless-mode.

In SLURM, we have cgroups v2 but we don't have systemd. Is there a way to run dind without cgroups and systemd? My understanding is that cgroups constrains resources, and the SLURM scheduler already use cgroups to constrain resources for the entire job, so we don't really need Docker to do another layer of constraint.

alexboden commented 2 months ago

docker run -v /tmp/run/docker.sock:/var/run/docker.sock -ti docker sh mounting the socket directly works

alexboden commented 2 months ago

change /tmp to /dev/shm to use ramfs https://superuser.com/a/45509

alexboden commented 2 months ago

https://github.com/WATonomous/infra-config/blob/d3aaab744f1b14a490e1998a0c1918848954285c/provision.bash#L227C1-L227C178

alexboden commented 2 months ago

Resolved by mounting socket directly docker run -v /tmp/run/docker.sock:/var/run/docker.sock -ti docker sh