concourse / concourse-docker

Offical concourse/concourse Docker image.
Apache License 2.0
241 stars 153 forks source link

Docker swarm incompatability #50

Open Zagitta opened 4 years ago

Zagitta commented 4 years ago

It might be worth to write in the documentation this won't work on docker swarm due to the requirement of privileged mode. The database and web containers will work just fine however the worker node will fail with some very cryptic error messages like: {"timestamp":"2019-09-30T14:31:24.520408669Z","level":"error","source":"guardian","message":"guardian.starting-guardian-backend","data":{"error":"bulk starter: mounting subsystem 'cpuset' in '/sys/fs/cgroup/cpuset': operation not permitted"}} and {"timestamp":"2019-09-30T14:31:24.528488853Z","level":"error","source":"worker","message":"worker.garden-runner.logging-runner-exited","data":{"error":"Exit trace for group:\ngdn exited with error: exit status 1\ndns-proxy exited with nil\n","session":"8"}} which disappears rather quickly because the following error gets spammed repeatedly {"timestamp":"2019-09-30T14:31:28.144058311Z","level":"error","source":"worker","message":"worker.beacon-runner.beacon.forward-conn.failed-to-dial","data":{"addr":"127.0.0.1:7777","error":"dial tcp 127.0.0.1:7777: connect: connection refused","network":"tcp","session":"4.1.5"}}

The web node also registers the worker node leading to further confusion. Hopefully this saves someone else a couple of painful hours.

4n70w4 commented 3 years ago

THe same issue:

worker_1  | {"timestamp":"2021-01-18T13:19:06.540640000Z","level":"error","source":"baggageclaim","message":"baggageclaim.fs.run-command.failed","data":{"args":["bash","-e","-x","-c","\n\t\tif [ ! -e $IMAGE_PATH ] || [ \"$(stat --printf=\"%s\" $IMAGE_PATH)\" != \"$SIZE_IN_BYTES\" ]; then\n\t\t\ttouch $IMAGE_PATH\n\t\t\ttruncate -s ${SIZE_IN_BYTES} $IMAGE_PATH\n\t\tfi\n\n\t\tlo=\"$(losetup -j $IMAGE_PATH | cut -d':' -f1)\"\n\t\tif [ -z \"$lo\" ]; then\n\t\t\tlo=\"$(losetup -f --show $IMAGE_PATH)\"\n\t\tfi\n\n\t\tif ! file $IMAGE_PATH | grep BTRFS; then\n\t\t\tmkfs.btrfs --nodiscard $IMAGE_PATH\n\t\tfi\n\n\t\tmkdir -p $MOUNT_PATH\n\n\t\tif ! mountpoint -q $MOUNT_PATH; then\n\t\t\tmount -t btrfs -o discard $lo $MOUNT_PATH\n\t\tfi\n\t"],"command":"/bin/bash","env":["PATH=/usr/local/concourse/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin","MOUNT_PATH=/worker-state/volumes","IMAGE_PATH=/worker-state/volumes.img","SIZE_IN_BYTES=258752974848"],"error":"exit status 1","session":"3.1","stderr":"+ '[' '!' -e /worker-state/volumes.img ']'\n+ touch /worker-state/volumes.img\n+ truncate -s 258752974848 /worker-state/volumes.img\n++ losetup -j /worker-state/volumes.img\n++ cut -d: -f1\n+ lo=\n+ '[' -z '' ']'\n++ losetup -f --show /worker-state/volumes.img\nlosetup: cannot find an unused loop device\n+ lo=\n","stdout":""}}
worker_1  | {"timestamp":"2021-01-18T13:19:06.540741000Z","level":"error","source":"baggageclaim","message":"baggageclaim.failed-to-set-up-driver","data":{"error":"failed to create btrfs filesystem: exit status 1"}}
worker_1  | error: failed to create btrfs filesystem: exit status 1
concourse-docker_worker_1 exited with code 1
FallingSnow commented 3 years ago

I'm running into this issue on #70 but I'm not using docker swarm, just docker.

Eeems commented 2 years ago

https://github.com/moby/moby/issues/24862 Looks like this wont be solved anytime soon.

Eeems commented 2 years ago

I've managed to get a little further by replacing privileged: true with cap_add: [NET_ADMIN] and setting CONCOURSE_RUNTIME to containerd

I'm now stuck on the following error:

{"timestamp":"2022-01-28T19:11:04.686241153Z","level":"error","source":"baggageclaim","message":"baggageclaim.api.volume-server.create-volume-async.failed-to-create","data":{"error":"operation not permitted","handle":"cbd0b4dd-84f8-4a9d-4b01-8ac8c27a968e","privileged":true,"session":"4.1.10","strategy":{"type":"import","path":"/usr/local/concourse/resource-types/docker-image/rootfs.tgz","follow_symlinks":false}}}

Which shows up as run check: find or create container on worker 3272415d73a2: failed to create volume on web ui.

It might be because I'm trying to create a privileged docker-image container.

balthild commented 2 years ago

I've just brought up worker service in docker swarm successfully with sysbox-runc. But it requires me to set the default runtime of nodes that will run worker containers because docker stack does not suppport the runtime prop on docker-compose.yml:

# cat /etc/docker/daemon.json

{
    "runtimes": {
        "sysbox-runc": {
            "path": "/usr/bin/sysbox-runc"
        }
    },
    "default-runtime": "sysbox-runc"
}

When I try running the hello-world example pipeline from the doc, I get a similar error:

run check: find or create container on worker dc72cdcf8d3d: failed to create volume

However, the reason showed in logs is strange:

concourse_worker.0.j6y38t6ei8o7@swarm-2    | {"timestamp":"2022-06-24T15:24:51.529209021Z","level":"error","source":"baggageclaim","message":"baggageclaim.api.volume-server.create-volume-async.failed-to-create","data":{"error":"invalid argument","handle":"7c77c360-c3ed-46aa-62bc-dae9695f43b6","privileged":false,"session":"4.1.87","strategy":{"type":"cow","volume":"df398c77-1fcc-42d2-5987-77b450071893"}}}

It's not something about permissions, but "invalid argument".

Here's my docker-compose.yml:

```yaml version: '3.9' services: web: image: concourse/concourse command: web ports: - published: 8084 target: 8080 mode: host networks: - concourse deploy: mode: global placement: constraints: - "node.role == manager" secrets: - authorized_worker_keys - session_signing_key - tsa_host_key - tsa_host_key.pub environment: CONCOURSE_EXTERNAL_URL: https://concourse.xxxxxxxxxxxx.com CONCOURSE_POSTGRES_HOST: xxxxxxxxxxxx CONCOURSE_POSTGRES_USER: concourse CONCOURSE_POSTGRES_PASSWORD: xxxxxxxxxxxx CONCOURSE_POSTGRES_DATABASE: concourse CONCOURSE_ADD_LOCAL_USER: balthild:xxxxxxxxxxxx CONCOURSE_MAIN_TEAM_LOCAL_USER: balthild CONCOURSE_SESSION_SIGNING_KEY: /run/secrets/session_signing_key CONCOURSE_TSA_AUTHORIZED_KEYS: /run/secrets/authorized_worker_keys CONCOURSE_TSA_HOST_KEY: /run/secrets/tsa_host_key CONCOURSE_TSA_PUBLIC_KEY: /run/secrets/tsa_host_key.pub logging: driver: "json-file" options: max-file: "5" max-size: "10m" worker: image: concourse/concourse command: worker networks: - concourse #privileged: true #runtime: sysbox-runc depends_on: [web] stop_signal: SIGUSR2 deploy: mode: global placement: constraints: - "node.role != manager" secrets: - tsa_host_key.pub - worker_key - worker_key.pub environment: CONCOURSE_TSA_PUBLIC_KEY: /run/secrets/tsa_host_key.pub CONCOURSE_TSA_WORKER_PRIVATE_KEY: /run/secrets/worker_key CONCOURSE_TSA_HOST: web:2222 CONCOURSE_RUNTIME: containerd CONCOURSE_BIND_IP: 0.0.0.0 CONCOURSE_BAGGAGECLAIM_BIND_IP: 0.0.0.0 # avoid using loopbacks CONCOURSE_BAGGAGECLAIM_DRIVER: overlay # work with docker-compose's dns CONCOURSE_CONTAINERD_DNS_PROXY_ENABLE: "true" logging: driver: "json-file" options: max-file: "5" max-size: "10m" secrets: session_signing_key: file: ./keys/web/session_signing_key authorized_worker_keys: file: ./keys/web/authorized_worker_keys tsa_host_key: file: ./keys/web/tsa_host_key tsa_host_key.pub: file: ./keys/web/tsa_host_key.pub worker_key: file: ./keys/worker/worker_key worker_key.pub: file: ./keys/worker/worker_key.pub networks: concourse: driver: overlay ```
balthild commented 2 years ago

~It seems that the invalid argument error is related to #42. But the workaround mentioned there (mount a volume to /worker-state) does not work for me.~

Update: The real message describes the actual error is produced by kernel, and it can be viewed with journalctl -f.

Jun 24 16:28:48 swarm-2 kernel: overlayfs: idmapped layers are currently not supported

It's said that the support for idmapped layers in overlayfs will be available in Linux 5.19 (current mainline kernel is 5.18).