Open barrucadu opened 5 years ago
Are there any baggageclaim
logs with more information?
Here's the log from the systemd unit running docker-compose
: https://misc.barrucadu.co.uk/forever/e4355f6a-9b9e-449b-8263-196cc1222161/concourseci.log
There are a few baggageclaim
errors:
{"timestamp":"2019-06-08T16:52:06.743477477Z","level":"error","source":"baggageclaim","message":"baggageclaim.api.volume-server.create-volume-async.create-volume.failed-to-materialize-strategy","data":{"error":"invalid argument","handle":"f98b290f-3b3c-4ff7-5e1c-0069f418e0d1","session":"3.1.29.1"}}
{"timestamp":"2019-06-08T16:52:06.743542809Z","level":"error","source":"baggageclaim","message":"baggageclaim.api.volume-server.create-volume-async.failed-to-create","data":{"error":"invalid argument","handle":"f98b290f-3b3c-4ff7-5e1c-0069f418e0d1","privileged":true,"session":"3.1.29","strategy":{"type":"cow","volume":"50edbcad-c379-4e9f-4c2b-362041bcec32"},"ttl":0}}
{"timestamp":"2019-06-08T16:52:06.743579749Z","level":"error","source":"baggageclaim","message":"baggageclaim.api.volume-server.create-volume-async.create-volume.failed-to-materialize-strategy","data":{"error":"invalid argument","handle":"ad59841f-1ce1-4d90-6b70-8700467701dd","session":"3.1.34.1"}}
{"timestamp":"2019-06-08T16:52:06.743608924Z","level":"error","source":"baggageclaim","message":"baggageclaim.api.volume-server.create-volume-async.failed-to-create","data":{"error":"invalid argument","handle":"ad59841f-1ce1-4d90-6b70-8700467701dd","privileged":true,"session":"3.1.34","strategy":{"type":"cow","volume":"d1ad2edf-38b9-40f9-4048-da5300b5d0ab"},"ttl":0}}
{"timestamp":"2019-06-08T16:52:07.151643149Z","level":"error","source":"atc","message":"atc.pipelines.radar.scan-resource.interval-runner.tick.find-or-create-cow-volume-for-container.failed-to-create-volume-in-baggageclaim","data":{"container":"281d33e4-c50a-408e-5895-b70dcddcfade","error":"failed to create volume","pipeline":"ci","resource":"ci-base-image","session":"18.1.2.1.1.3","team":"main","volume":"ad59841f-1ce1-4d90-6b70-8700467701dd"}}
{"timestamp":"2019-06-08T16:52:07.162617819Z","level":"error","source":"atc","message":"atc.pipelines.radar.scan-resource.interval-runner.tick.find-or-create-cow-volume-for-container.failed-to-create-volume-in-baggageclaim","data":{"container":"ec58bb4f-214b-4377-5d96-4c37462eab68","error":"failed to create volume","pipeline":"ci","resource":"ci-agent-image","session":"18.1.1.1.1.3","team":"main","volume":"f98b290f-3b3c-4ff7-5e1c-0069f418e0d1"}}
{"timestamp":"2019-06-08T16:52:07.280776831Z","level":"error","source":"baggageclaim","message":"baggageclaim.api.volume-server.create-volume-async.create-volume.failed-to-materialize-strategy","data":{"error":"invalid argument","handle":"86e8680d-a0cb-48fc-4906-53894aa351c6","session":"3.1.42.1"}}
{"timestamp":"2019-06-08T16:52:07.280816616Z","level":"error","source":"baggageclaim","message":"baggageclaim.api.volume-server.create-volume-async.failed-to-create","data":{"error":"invalid argument","handle":"86e8680d-a0cb-48fc-4906-53894aa351c6","privileged":true,"session":"3.1.42","strategy":{"type":"cow","volume":"50edbcad-c379-4e9f-4c2b-362041bcec32"},"ttl":0}}
{"timestamp":"2019-06-08T16:52:07.619068627Z","level":"error","source":"atc","message":"atc.pipelines.radar.scan-resource.interval-runner.tick.find-or-create-cow-volume-for-container.failed-to-create-volume-in-baggageclaim","data":{"container":"81b2354f-6914-4cff-613a-89616cded84a","error":"failed to create volume","pipeline":"ci","resource":"ci-resource-rsync-image","session":"18.1.3.1.1.3","team":"main","volume":"86e8680d-a0cb-48fc-4906-53894aa351c6"}}
{"timestamp":"2019-06-08T16:52:10.371909263Z","level":"error","source":"baggageclaim","message":"baggageclaim.api.volume-server.create-volume-async.create-volume.failed-to-materialize-strategy","data":{"error":"invalid argument","handle":"f9c2581b-8f75-4713-7346-4fa7ec29b455","session":"3.1.50.1"}}
{"timestamp":"2019-06-08T16:52:10.371946002Z","level":"error","source":"baggageclaim","message":"baggageclaim.api.volume-server.create-volume-async.failed-to-create","data":{"error":"invalid argument","handle":"f9c2581b-8f75-4713-7346-4fa7ec29b455","privileged":false,"session":"3.1.50","strategy":{"type":"cow","volume":"05e99eb0-95a6-439b-6e08-9215476f7cc7"},"ttl":0}}
{"timestamp":"2019-06-08T16:52:10.373695769Z","level":"error","source":"atc","message":"atc.pipelines.radar.scan-resource.interval-runner.tick.find-or-create-cow-volume-for-container.failed-to-create-volume-in-baggageclaim","data":{"container":"40452054-8f69-4694-7b9a-483c97c6ded6","error":"failed to create volume","pipeline":"ci","resource":"concoursefiles-git","session":"18.1.4.1.1.3","team":"main","volume":"f9c2581b-8f75-4713-7346-4fa7ec29b455"}}
{"timestamp":"2019-06-08T16:52:18.362785417Z","level":"error","source":"baggageclaim","message":"baggageclaim.api.volume-server.create-volume-async.create-volume.failed-to-materialize-strategy","data":{"error":"invalid argument","handle":"ddfabe8d-4b23-44f7-598b-a9c30853eef3","session":"3.1.56.1"}}
{"timestamp":"2019-06-08T16:52:18.362824190Z","level":"error","source":"baggageclaim","message":"baggageclaim.api.volume-server.create-volume-async.failed-to-create","data":{"error":"invalid argument","handle":"ddfabe8d-4b23-44f7-598b-a9c30853eef3","privileged":false,"session":"3.1.56","strategy":{"type":"cow","volume":"05e99eb0-95a6-439b-6e08-9215476f7cc7"},"ttl":0}}
{"timestamp":"2019-06-08T16:52:18.987394396Z","level":"error","source":"atc","message":"atc.check-resource.find-or-create-cow-volume-for-container.failed-to-create-volume-in-baggageclaim","data":{"container":"713ee08b-487a-4158-6deb-69f48de6e58d","error":"failed to create volume","session":"367.3","volume":"ddfabe8d-4b23-44f7-598b-a9c30853eef3"}}
Looks like a pretty low-level failure, possibly from an incompatibility with your kernel/OS stack - we haven't tested NixOS. :thinking: To get to the bottom of the 'invalid argument' error you'll probably need to run strace
against the concourse worker
process. Sorry the logs aren't super useful.
I'm getting this on Arch running in compose as well. Tried running the worker with strace but I didn't see anything that stood out.
Switching my Docker storage driver to vfs
allows me to work around this issue but I don't think that's a solution. I haven't fully taken a dive into what's actually happening here.
EDIT: For some background, I'm building docker images as part of my pipeline.
@barrucadu setting:
CONCOURSE_WORK_DIR=/worker-state
CONCOURSE_WORKER_WORK_DIR=/worker-state
and adding a volume for the /worker-state
directory in my worker's service configuration was necessary for baggageclaim to create volumes.
I tried setting CONCOURSE_WORKER_WORK_DIR
, after adding a worker container (rather than using the quickstart
command), giving this docker-compose file, but had the original problem.
I then tried switching to the overlay2
storage driver, but docker doesn't seem to support overlay2
on zfs (do you also use zfs, @caiges?):
Error starting daemon: error initializing graphdriver: backing file system is unsupported for this graph driver
Then I tried switching to the vfs
storage driver, but still had the original problem.
I did a cursory search and couldn't find that CONCOURSE_WORKER_WORK_DIR
is referenced anywhere. CONCOURSE_WORK_DIR
does appear to be used.
I don't use ZFS but you could configure docker to use a different partition for its storage that supports "overlay2".
I'm getting this on Arch running in compose as well. Tried running the worker with strace but I didn't see anything that stood out.
FWIW I think you'd want to grep
the output for EINVAL
.
Here's a snippet that'll strip out a lot of noise:
strace -f -p (worker pid) -e '!futex,restart_syscall,epoll_wait,select,getdents64,close,sched_yield,epoll_ctl,accept4,setsockopt,getsockname'
I'm running into same issue - NixOS 19.09 and ZFS. I'll try debugging this...
[ 395.180725] overlayfs: filesystem on '/workdir/overlays/14745864-d72c-4d46-4dd3-e03ffb3a8585' not supported as upperdir
So I assume that worker strictly attempts to use overlayfs. I'm not entirely sure how Concourse works internally yet, but I'll try to feed an ext4 based workdir hosted on ZFS zvol to worker instead.
Yeah that seems to work.
1) Create zvol with ext4
zfs create -V 10g rpool/concourse-workdir0-ext4
mkfs.ext4 /dev/zvol/rpool/concourse-workdir0-ext4
2) Configure NixOS to mount it at /mnt/concourse-workdir0
Into configuration.nix, add:
fileSystems."/mnt/concourse-workdir0" = {
device = "/dev/zvol/rpool/concourse-workdir0-ext4";
fsType = "ext4";
};
3) Configure worker to use given workdir
CONCOURSE_WORK_DIR
to /workdir
We are seeing this error very frequently in the Spring Boot builds. We are running v5.7.2 on bosh-vsphere-esxi-ubuntu-xenial-go_agent 621.29
stemcell, using the overlay
driver.
In web.stdout.log
we have:
{"timestamp":"2020-01-23T15:35:59.923944142Z","level":"error","source":"atc","message":"atc.tracker.track.task-step.find-or-create-volume-for-container.failed-to-create-volume-in-baggageclaim","data":{"build":104720,"error":"failed to create volume","job":"build-pull-requests","job-id":2744,"pipeline":"spring-boot-2.3.x","session":"19.62686.7.31","step-name":"build-project","volume":"999ba5a8-f8a1-4e5d-5087-c5e3974e15e1"}}
In worker.stdout.log
we see the baggageclaim error:
{"timestamp":"2020-01-23T15:35:55.212121511Z","level":"error","source":"baggageclaim","message":"baggageclaim.api.volume-server.create-volume-async.create-volume.failed-to-materialize-strategy","data":{"error":"exit status 1","handle":"999ba5a8-f8a1-4e5d-5087-c5e3974e15e1","session":"3.1.999394.1"}}
{"timestamp":"2020-01-23T15:35:55.299415431Z","level":"error","source":"baggageclaim","message":"baggageclaim.api.volume-server.create-volume-async.failed-to-create","data":{"error":"exit status 1","handle":"999ba5a8-f8a1-4e5d-5087-c5e3974e15e1","privileged":false,"session":"3.1.999394","strategy":{"type":"import","path":"/var/vcap/data/worker/work/volumes/live/24ae1aac-852c-4c5c-414d-29088119c8a3/volume","follow_symlinks":false}}}
The error arrives at the end of builds. The pipelines use https://concourse-ci.org/tasks.html#task-caches to cache dependencies between runs.
https://github.com/spring-projects/spring-boot/blob/89237634c7931f275ddbddba176c7a826b1667cb/ci/tasks/build-project.yml#L7
When we query the volumes
table by handle
, we can confirm no record was created for 999ba5a8-f8a1-4e5d-5087-c5e3974e15e1
.
We considered underlying server load, so enabled container-placement-strategy-limit-active-tasks
, which distibuted things nicely (thank you!). Now that load seems fine, it is mainly the Spring Boot pipelines that have this issue in our multi-tenant https://ci.spring.io.
We can re-recreate all of the workers to make the issue go way for a few days, but it eventually comes back. We see a clear pattern of the error re-surfacing after a number of green builds. reported in #concourse-operations.
I started seeing this error after upgrading from concourse 6.1.0 to 6.7.1. I have downgraded back to 6.1.0.
Only resources using custom resource types are affected. I am running my workers on Flatcar Linux (successor of the defunct CoreOS) as a docker container started by a systemd unit. I have tried setting the baggageclaim driver to overlay and naive with the same results as the default value. I have tried mounting a volume in the container and using it as the work directory with the same results.
The kernel is 5.4.77-flatcar. The filesystem is ext4 and there is plenty of space. The docker is version 19.03.12 running with defaults plus a registry mirror. Here is the systemd unit that I use to start the worker container:
[Unit]
Description=concourse-worker
After=network-online.target
After=docker.service
After=coreos-metadata.service
Requires=docker.service
Requires=coreos-metadata.service
[Service]
TimeoutStartSec=0
Restart=always
EnvironmentFile=/run/metadata/flatcar
ExecStartPre=-/usr/bin/docker stop -t 100 concourse-worker
ExecStartPre=-/usr/bin/docker rm concourse-worker
ExecStartPre=/usr/bin/docker pull concourse/concourse:6.7.1
ExecStartPre=/usr/bin/docker volume create worker-scratch
ExecStart=/usr/bin/docker run \
--privileged \
--name concourse-worker \
--volume /stuff/concourse/worker:/concourse-keys:ro \
--volume worker-scratch:/work \
concourse/concourse:6.7.1 \
worker \
--tsa-host concourse-tsa.movealong.internal:2222 \
--tsa-public-key /concourse-keys/tsa_host_key.pub \
--tsa-worker-private-key /concourse-keys/worker_key \
--work-dir /work \
--ephemeral
ExecStop=/usr/bin/docker stop -t 100 concourse-worker
ExecStop=/usr/bin/docker volume rm worker-scratch
[Install]
WantedBy=multi-user.target
I've isolated the problem to the upgrade from 6.6.0 to 6.7.1. All concourse minor versions fro 6.1.0 though 6.6.0 are able to process resources with a custom resource type correctly.
I've got Concourse running on a NixOS 18.03 VPS inside docker-compose, and this is working fine. I'm now trying to deploy exactly the same Concourse configuration to another NixOS 18.03 machine, but aren't having any luck. I'm using the same docker-compose file, and the same pipelines.
The new machine gives errors about being unable to create volumes:
The
concoursefiles-git
resource it's failing to create a volume for there is a normal git resource. The other resources in the pipeline are failing with the same error.The pipeline is here: https://github.com/barrucadu/concoursefiles/blob/master/pipelines/ci.yml
This is the docker-compose file:
I'm using the latest
concourse/concourse
image, as I set this up today. The version of docker is 18.09.2 (build 62479626f213818ba5b4565105a05277308587d5). What can I look at to help debug this?