concourse / concourse-docker

Offical concourse/concourse Docker image.
Apache License 2.0
241 stars 152 forks source link

"failed to create volume", Concourse running in docker-compose on Linux #42

Open barrucadu opened 5 years ago

barrucadu commented 5 years ago

I've got Concourse running on a NixOS 18.03 VPS inside docker-compose, and this is working fine. I'm now trying to deploy exactly the same Concourse configuration to another NixOS 18.03 machine, but aren't having any luck. I'm using the same docker-compose file, and the same pipelines.

The new machine gives errors about being unable to create volumes:

Apr 12 21:55:49 nyarlathotep docker-compose[26088]: concourse_1  | {"timestamp":"2019-04-12T20:55:49.753780802Z","level":"error","source":"atc","message":"atc.pipelines.radar.scan-resource.interval-runner.tick.find-or-create-cow-volume-for-container.failed-to-create-volume-in-baggageclaim","data":{"container":"af97f489-2d27-4007-57b4-e5cb9c43e659","error":"failed to create volume","pipeline":"ci","resource":"concoursefiles-git","session":"18.1.4.1.1.3","team":"main","volume":"e843e1a7-4122-494b-5397-d0a94294e418"}}
Apr 12 21:55:49 nyarlathotep docker-compose[26088]: concourse_1  | {"timestamp":"2019-04-12T20:55:49.793734883Z","level":"error","source":"atc","message":"atc.pipelines.radar.scan-resource.interval-runner.tick.failed-to-fetch-image-for-container","data":{"container":"af97f489-2d27-4007-57b4-e5cb9c43e659","error":"failed to create volume","pipeline":"ci","resource":"concoursefiles-git","session":"18.1.4.1.1","team":"main"}}
Apr 12 21:55:49 nyarlathotep docker-compose[26088]: concourse_1  | {"timestamp":"2019-04-12T20:55:49.794088237Z","level":"error","source":"atc","message":"atc.pipelines.radar.scan-resource.interval-runner.tick.failed-to-initialize-new-container","data":{"error":"failed to create volume","pipeline":"ci","resource":"concoursefiles-git","session":"18.1.4.1.1","team":"main"}}

The concoursefiles-git resource it's failing to create a volume for there is a normal git resource. The other resources in the pipeline are failing with the same error.

The pipeline is here: https://github.com/barrucadu/concoursefiles/blob/master/pipelines/ci.yml

This is the docker-compose file:

version: '3'

services:
  concourse:
    image: concourse/concourse
    command: quickstart
    privileged: true
    depends_on: [postgres, registry]
    ports: ["3003:8080"]
    environment:
      CONCOURSE_POSTGRES_HOST: postgres
      CONCOURSE_POSTGRES_USER: concourse
      CONCOURSE_POSTGRES_PASSWORD: concourse
      CONCOURSE_POSTGRES_DATABASE: concourse
      CONCOURSE_EXTERNAL_URL: "https://ci.nyarlathotep.barrucadu.co.uk"
      CONCOURSE_MAIN_TEAM_GITHUB_USER: "barrucadu"
      CONCOURSE_GITHUB_CLIENT_ID: "<omitted>"
      CONCOURSE_GITHUB_CLIENT_SECRET: "<omitted>"
      CONCOURSE_LOG_LEVEL: error
      CONCOURSE_GARDEN_LOG_LEVEL: error
    networks:
      - ci

  postgres:
    image: postgres
    environment:
      POSTGRES_DB: concourse
      POSTGRES_PASSWORD: concourse
      POSTGRES_USER: concourse
      PGDATA: /database
    networks:
      - ci
    volumes:
      - pgdata:/database

  registry:
    image: registry
    networks:
      ci:
        ipv4_address: "172.21.0.254"
        aliases: [ci-registry]
    volumes:
      - regdata:/var/lib/registry

networks:
  ci:
    ipam:
      driver: default
      config:
        - subnet: 172.21.0.0/16

volumes:
  pgdata:
  regdata:

I'm using the latest concourse/concourse image, as I set this up today. The version of docker is 18.09.2 (build 62479626f213818ba5b4565105a05277308587d5). What can I look at to help debug this?

vito commented 5 years ago

Are there any baggageclaim logs with more information?

barrucadu commented 5 years ago

Here's the log from the systemd unit running docker-compose: https://misc.barrucadu.co.uk/forever/e4355f6a-9b9e-449b-8263-196cc1222161/concourseci.log

There are a few baggageclaim errors:

{"timestamp":"2019-06-08T16:52:06.743477477Z","level":"error","source":"baggageclaim","message":"baggageclaim.api.volume-server.create-volume-async.create-volume.failed-to-materialize-strategy","data":{"error":"invalid argument","handle":"f98b290f-3b3c-4ff7-5e1c-0069f418e0d1","session":"3.1.29.1"}}
{"timestamp":"2019-06-08T16:52:06.743542809Z","level":"error","source":"baggageclaim","message":"baggageclaim.api.volume-server.create-volume-async.failed-to-create","data":{"error":"invalid argument","handle":"f98b290f-3b3c-4ff7-5e1c-0069f418e0d1","privileged":true,"session":"3.1.29","strategy":{"type":"cow","volume":"50edbcad-c379-4e9f-4c2b-362041bcec32"},"ttl":0}}
{"timestamp":"2019-06-08T16:52:06.743579749Z","level":"error","source":"baggageclaim","message":"baggageclaim.api.volume-server.create-volume-async.create-volume.failed-to-materialize-strategy","data":{"error":"invalid argument","handle":"ad59841f-1ce1-4d90-6b70-8700467701dd","session":"3.1.34.1"}}
{"timestamp":"2019-06-08T16:52:06.743608924Z","level":"error","source":"baggageclaim","message":"baggageclaim.api.volume-server.create-volume-async.failed-to-create","data":{"error":"invalid argument","handle":"ad59841f-1ce1-4d90-6b70-8700467701dd","privileged":true,"session":"3.1.34","strategy":{"type":"cow","volume":"d1ad2edf-38b9-40f9-4048-da5300b5d0ab"},"ttl":0}}
{"timestamp":"2019-06-08T16:52:07.151643149Z","level":"error","source":"atc","message":"atc.pipelines.radar.scan-resource.interval-runner.tick.find-or-create-cow-volume-for-container.failed-to-create-volume-in-baggageclaim","data":{"container":"281d33e4-c50a-408e-5895-b70dcddcfade","error":"failed to create volume","pipeline":"ci","resource":"ci-base-image","session":"18.1.2.1.1.3","team":"main","volume":"ad59841f-1ce1-4d90-6b70-8700467701dd"}}
{"timestamp":"2019-06-08T16:52:07.162617819Z","level":"error","source":"atc","message":"atc.pipelines.radar.scan-resource.interval-runner.tick.find-or-create-cow-volume-for-container.failed-to-create-volume-in-baggageclaim","data":{"container":"ec58bb4f-214b-4377-5d96-4c37462eab68","error":"failed to create volume","pipeline":"ci","resource":"ci-agent-image","session":"18.1.1.1.1.3","team":"main","volume":"f98b290f-3b3c-4ff7-5e1c-0069f418e0d1"}}
{"timestamp":"2019-06-08T16:52:07.280776831Z","level":"error","source":"baggageclaim","message":"baggageclaim.api.volume-server.create-volume-async.create-volume.failed-to-materialize-strategy","data":{"error":"invalid argument","handle":"86e8680d-a0cb-48fc-4906-53894aa351c6","session":"3.1.42.1"}}
{"timestamp":"2019-06-08T16:52:07.280816616Z","level":"error","source":"baggageclaim","message":"baggageclaim.api.volume-server.create-volume-async.failed-to-create","data":{"error":"invalid argument","handle":"86e8680d-a0cb-48fc-4906-53894aa351c6","privileged":true,"session":"3.1.42","strategy":{"type":"cow","volume":"50edbcad-c379-4e9f-4c2b-362041bcec32"},"ttl":0}}
{"timestamp":"2019-06-08T16:52:07.619068627Z","level":"error","source":"atc","message":"atc.pipelines.radar.scan-resource.interval-runner.tick.find-or-create-cow-volume-for-container.failed-to-create-volume-in-baggageclaim","data":{"container":"81b2354f-6914-4cff-613a-89616cded84a","error":"failed to create volume","pipeline":"ci","resource":"ci-resource-rsync-image","session":"18.1.3.1.1.3","team":"main","volume":"86e8680d-a0cb-48fc-4906-53894aa351c6"}}
{"timestamp":"2019-06-08T16:52:10.371909263Z","level":"error","source":"baggageclaim","message":"baggageclaim.api.volume-server.create-volume-async.create-volume.failed-to-materialize-strategy","data":{"error":"invalid argument","handle":"f9c2581b-8f75-4713-7346-4fa7ec29b455","session":"3.1.50.1"}}
{"timestamp":"2019-06-08T16:52:10.371946002Z","level":"error","source":"baggageclaim","message":"baggageclaim.api.volume-server.create-volume-async.failed-to-create","data":{"error":"invalid argument","handle":"f9c2581b-8f75-4713-7346-4fa7ec29b455","privileged":false,"session":"3.1.50","strategy":{"type":"cow","volume":"05e99eb0-95a6-439b-6e08-9215476f7cc7"},"ttl":0}}
{"timestamp":"2019-06-08T16:52:10.373695769Z","level":"error","source":"atc","message":"atc.pipelines.radar.scan-resource.interval-runner.tick.find-or-create-cow-volume-for-container.failed-to-create-volume-in-baggageclaim","data":{"container":"40452054-8f69-4694-7b9a-483c97c6ded6","error":"failed to create volume","pipeline":"ci","resource":"concoursefiles-git","session":"18.1.4.1.1.3","team":"main","volume":"f9c2581b-8f75-4713-7346-4fa7ec29b455"}}
{"timestamp":"2019-06-08T16:52:18.362785417Z","level":"error","source":"baggageclaim","message":"baggageclaim.api.volume-server.create-volume-async.create-volume.failed-to-materialize-strategy","data":{"error":"invalid argument","handle":"ddfabe8d-4b23-44f7-598b-a9c30853eef3","session":"3.1.56.1"}}
{"timestamp":"2019-06-08T16:52:18.362824190Z","level":"error","source":"baggageclaim","message":"baggageclaim.api.volume-server.create-volume-async.failed-to-create","data":{"error":"invalid argument","handle":"ddfabe8d-4b23-44f7-598b-a9c30853eef3","privileged":false,"session":"3.1.56","strategy":{"type":"cow","volume":"05e99eb0-95a6-439b-6e08-9215476f7cc7"},"ttl":0}}
{"timestamp":"2019-06-08T16:52:18.987394396Z","level":"error","source":"atc","message":"atc.check-resource.find-or-create-cow-volume-for-container.failed-to-create-volume-in-baggageclaim","data":{"container":"713ee08b-487a-4158-6deb-69f48de6e58d","error":"failed to create volume","session":"367.3","volume":"ddfabe8d-4b23-44f7-598b-a9c30853eef3"}}
vito commented 5 years ago

Looks like a pretty low-level failure, possibly from an incompatibility with your kernel/OS stack - we haven't tested NixOS. :thinking: To get to the bottom of the 'invalid argument' error you'll probably need to run strace against the concourse worker process. Sorry the logs aren't super useful.

caiges commented 5 years ago

I'm getting this on Arch running in compose as well. Tried running the worker with strace but I didn't see anything that stood out.

caiges commented 5 years ago

Switching my Docker storage driver to vfs allows me to work around this issue but I don't think that's a solution. I haven't fully taken a dive into what's actually happening here.

EDIT: For some background, I'm building docker images as part of my pipeline.

caiges commented 5 years ago

@barrucadu setting:

CONCOURSE_WORK_DIR=/worker-state
CONCOURSE_WORKER_WORK_DIR=/worker-state

and adding a volume for the /worker-state directory in my worker's service configuration was necessary for baggageclaim to create volumes.

barrucadu commented 5 years ago

I tried setting CONCOURSE_WORKER_WORK_DIR, after adding a worker container (rather than using the quickstart command), giving this docker-compose file, but had the original problem.

I then tried switching to the overlay2 storage driver, but docker doesn't seem to support overlay2 on zfs (do you also use zfs, @caiges?):

Error starting daemon: error initializing graphdriver: backing file system is unsupported for this graph driver

Then I tried switching to the vfs storage driver, but still had the original problem.

caiges commented 5 years ago

I did a cursory search and couldn't find that CONCOURSE_WORKER_WORK_DIR is referenced anywhere. CONCOURSE_WORK_DIR does appear to be used.

I don't use ZFS but you could configure docker to use a different partition for its storage that supports "overlay2".

vito commented 5 years ago

I'm getting this on Arch running in compose as well. Tried running the worker with strace but I didn't see anything that stood out.

FWIW I think you'd want to grep the output for EINVAL.

Here's a snippet that'll strip out a lot of noise:

strace -f -p (worker pid) -e '!futex,restart_syscall,epoll_wait,select,getdents64,close,sched_yield,epoll_ctl,accept4,setsockopt,getsockname'
mikroskeem commented 5 years ago

I'm running into same issue - NixOS 19.09 and ZFS. I'll try debugging this...

mikroskeem commented 5 years ago

[ 395.180725] overlayfs: filesystem on '/workdir/overlays/14745864-d72c-4d46-4dd3-e03ffb3a8585' not supported as upperdir

So I assume that worker strictly attempts to use overlayfs. I'm not entirely sure how Concourse works internally yet, but I'll try to feed an ext4 based workdir hosted on ZFS zvol to worker instead.

mikroskeem commented 5 years ago

Yeah that seems to work.

1) Create zvol with ext4

zfs create -V 10g rpool/concourse-workdir0-ext4
mkfs.ext4 /dev/zvol/rpool/concourse-workdir0-ext4

2) Configure NixOS to mount it at /mnt/concourse-workdir0

Into configuration.nix, add:

fileSystems."/mnt/concourse-workdir0" = {
  device = "/dev/zvol/rpool/concourse-workdir0-ext4";
  fsType = "ext4";
};

3) Configure worker to use given workdir

trevormarshall commented 4 years ago

We are seeing this error very frequently in the Spring Boot builds. We are running v5.7.2 on bosh-vsphere-esxi-ubuntu-xenial-go_agent 621.29 stemcell, using the overlay driver.

In web.stdout.log we have:

{"timestamp":"2020-01-23T15:35:59.923944142Z","level":"error","source":"atc","message":"atc.tracker.track.task-step.find-or-create-volume-for-container.failed-to-create-volume-in-baggageclaim","data":{"build":104720,"error":"failed to create volume","job":"build-pull-requests","job-id":2744,"pipeline":"spring-boot-2.3.x","session":"19.62686.7.31","step-name":"build-project","volume":"999ba5a8-f8a1-4e5d-5087-c5e3974e15e1"}}

In worker.stdout.log we see the baggageclaim error:

{"timestamp":"2020-01-23T15:35:55.212121511Z","level":"error","source":"baggageclaim","message":"baggageclaim.api.volume-server.create-volume-async.create-volume.failed-to-materialize-strategy","data":{"error":"exit status 1","handle":"999ba5a8-f8a1-4e5d-5087-c5e3974e15e1","session":"3.1.999394.1"}}
{"timestamp":"2020-01-23T15:35:55.299415431Z","level":"error","source":"baggageclaim","message":"baggageclaim.api.volume-server.create-volume-async.failed-to-create","data":{"error":"exit status 1","handle":"999ba5a8-f8a1-4e5d-5087-c5e3974e15e1","privileged":false,"session":"3.1.999394","strategy":{"type":"import","path":"/var/vcap/data/worker/work/volumes/live/24ae1aac-852c-4c5c-414d-29088119c8a3/volume","follow_symlinks":false}}}

The error arrives at the end of builds. The pipelines use https://concourse-ci.org/tasks.html#task-caches to cache dependencies between runs. https://github.com/spring-projects/spring-boot/blob/89237634c7931f275ddbddba176c7a826b1667cb/ci/tasks/build-project.yml#L7 When we query the volumes table by handle, we can confirm no record was created for 999ba5a8-f8a1-4e5d-5087-c5e3974e15e1.

We considered underlying server load, so enabled container-placement-strategy-limit-active-tasks, which distibuted things nicely (thank you!). Now that load seems fine, it is mainly the Spring Boot pipelines that have this issue in our multi-tenant https://ci.spring.io.

We can re-recreate all of the workers to make the issue go way for a few days, but it eventually comes back. We see a clear pattern of the error re-surfacing after a number of green builds. reported in #concourse-operations.

inkblot commented 4 years ago

I started seeing this error after upgrading from concourse 6.1.0 to 6.7.1. I have downgraded back to 6.1.0.

Only resources using custom resource types are affected. I am running my workers on Flatcar Linux (successor of the defunct CoreOS) as a docker container started by a systemd unit. I have tried setting the baggageclaim driver to overlay and naive with the same results as the default value. I have tried mounting a volume in the container and using it as the work directory with the same results.

The kernel is 5.4.77-flatcar. The filesystem is ext4 and there is plenty of space. The docker is version 19.03.12 running with defaults plus a registry mirror. Here is the systemd unit that I use to start the worker container:

[Unit]
Description=concourse-worker
After=network-online.target
After=docker.service
After=coreos-metadata.service
Requires=docker.service
Requires=coreos-metadata.service

[Service]
TimeoutStartSec=0
Restart=always
EnvironmentFile=/run/metadata/flatcar
ExecStartPre=-/usr/bin/docker stop -t 100 concourse-worker
ExecStartPre=-/usr/bin/docker rm concourse-worker
ExecStartPre=/usr/bin/docker pull concourse/concourse:6.7.1
ExecStartPre=/usr/bin/docker volume create worker-scratch
ExecStart=/usr/bin/docker run \
  --privileged \
  --name concourse-worker \
  --volume /stuff/concourse/worker:/concourse-keys:ro \
  --volume worker-scratch:/work \
  concourse/concourse:6.7.1 \
  worker \
  --tsa-host concourse-tsa.movealong.internal:2222 \
  --tsa-public-key /concourse-keys/tsa_host_key.pub \
  --tsa-worker-private-key /concourse-keys/worker_key \
  --work-dir /work \
  --ephemeral
ExecStop=/usr/bin/docker stop -t 100 concourse-worker
ExecStop=/usr/bin/docker volume rm worker-scratch

[Install]
WantedBy=multi-user.target
inkblot commented 4 years ago

I've isolated the problem to the upgrade from 6.6.0 to 6.7.1. All concourse minor versions fro 6.1.0 though 6.6.0 are able to process resources with a custom resource type correctly.