concourse / concourse-docker

Offical concourse/concourse Docker image.
Apache License 2.0
241 stars 153 forks source link

containerized concourse 7.4.1 with cgroup v2 + containerd results in "max containers reached" errors #76

Closed mlilien closed 2 years ago

mlilien commented 2 years ago

We use concourse 7.4.1 in a docker container on a system with cgroups v2 enabled. The worker is configured with CONCOURSE_RUNTIME: containerd. We get "max containers reached" errors shortly after startup. When we list the workers containers we get a list of around 120 entries, but not 250. Expected is a value around 60. The error seems to be related to our hosts cgroup v2 configuration:

time="2021-11-02T14:50:46.996827586Z" level=info msg="starting signal loop" namespace=concourse path=/run/containerd/io.containerd.runtime.v2.task/concourse/1f728166-46a8-4aea-57f3-1e6c4c6cb67d pid=80
time="2021-11-02T14:50:47.148009483Z" level=error msg="failed to enable controllers ([cpuset cpu io memory pids rdma])" error="failed to write subtree controllers [cpuset cpu io memory pids rdma] to \"/sys/fs/cgroup/cgroup.subtree_control\": write /sys/fs/cgroup/cgroup.subtree_control: operation not supported"
...
#lots of:
{"timestamp":"2021-11-02T15:26:08.599015769Z","level":"error","source":"worker","message":"worker.garden.garden-server.destroy.failed","data":{"error":"gracefully killing task: graceful kill: kill task execed processes: task execed processes: pid listing: runc did not terminate successfully: exit status 1: container_linux.go:187: getting all container pids from cgroups caused: read /sys/fs/cgroup/61ed35fc-d484-4a41-7bab-c2888aac853a/cgroup.procs: operation not supported\n: unknown","handle":"61ed35fc-d484-4a41-7bab-c2888aac853a","session":"1.4.10525"}}
{"timestamp":"2021-11-02T15:26:08.599018255Z","level":"error","source":"worker","message":"worker.garden.garden-server.destroy.failed","data":{"error":"gracefully killing task: graceful kill: kill task execed processes: task execed processes: pid listing: runc did not terminate successfully: exit status 1: container_linux.go:187: getting all container pids from cgroups caused: read /sys/fs/cgroup/garden/24029699-6902-40c3-7d04-603388b95014/cgroup.procs: operation not supported\n: unknown","handle":"24029699-6902-40c3-7d04-603388b95014","session":"1.4.10526"}}
{"timestamp":"2021-11-02T15:26:08.599253941Z","level":"error","source":"worker","message":"worker.container-sweeper.tick.failed-to-destroy-container","data":{"error":"gracefully killing task: graceful kill: kill task execed processes: task execed processes: pid listing: runc did not terminate successfully: exit status 1: container_linux.go:187: getting all container pids from cgroups caused: read /sys/fs/cgroup/61ed35fc-d484-4a41-7bab-c2888aac853a/cgroup.procs: operation not supported\n: unknown","handle":"61ed35fc-d484-4a41-7bab-c2888aac853a","session":"6.72"}}
{"timestamp":"2021-11-02T15:26:08.599354579Z","level":"error","source":"worker","message":"worker.container-sweeper.tick.failed-to-destroy-container","data":{"error":"gracefully killing task: graceful kill: kill task execed processes: task execed processes: pid listing: runc did not terminate successfully: exit status 1: container_linux.go:187: getting all container pids from cgroups caused: read /sys/fs/cgroup/garden/24029699-6902-40c3-7d04-603388b95014/cgroup.procs: operation not supported\n: unknown","handle":"24029699-6902-40c3-7d04-603388b95014","session":"6.72"}}
{"timestamp":"2021-11-02T15:26:08.600572572Z","level":"error","source":"worker","message":"worker.garden.garden-server.destroy.failed","data":{"error":"gracefully killing task: graceful kill: kill task execed processes: task execed processes: pid listing: runc did not terminate successfully: exit status 1: container_linux.go:187: getting all container pids from cgroups caused: read /sys/fs/cgroup/bb048491-1275-4383-421b-0fe76fc3ed16/cgroup.procs: operation not supported\n: unknown","handle":"bb048491-1275-4383-421b-0fe76fc3ed16","session":"1.4.10527"}}
{"timestamp":"2021-11-02T15:26:08.600745099Z","level":"error","source":"worker","message":"worker.container-sweeper.tick.failed-to-destroy-container","data":{"error":"gracefully killing task: graceful kill: kill task execed processes: task execed processes: pid listing: runc did not terminate successfully: exit status 1: container_linux.go:187: getting all container pids from cgroups caused: read /sys/fs/cgroup/bb048491-1275-4383-421b-0fe76fc3ed16/cgroup.procs: operation not supported\n: unknown","handle":"bb048491-1275-4383-421b-0fe76fc3ed16","session":"6.72"}}
{"timestamp":"2021-11-02T15:26:08.601499295Z","level":"error","source":"worker","message":"worker.garden.garden-server.destroy.failed","data":{"error":"gracefully killing task: graceful kill: kill task execed processes: task execed processes: pid listing: runc did not terminate successfully: exit status 1: container_linux.go:187: getting all container pids from cgroups caused: read /sys/fs/cgroup/garden/3eae43f4-1fa6-4361-52f0-f21b4492793e/cgroup.procs: operation not supported\n: unknown","handle":"3eae43f4-1fa6-4361-52f0-f21b4492793e","session":"1.4.10529"}}

...
{"timestamp":"2021-11-02T15:28:30.143224228Z","level":"error","source":"worker","message":"worker.garden.garden-server.create.failed","data":{"error":"new container: checking container capacity: max containers reached","request":{"Handle":"3272600c-218d-4bd6-619f-8248b197d285","GraceTime":0,"RootFSPath":"raw:///worker-state/volumes/live/721c8836-2e07-4d2c-546f-56fac591d58c/volume","BindMounts":[{"src_path":"/worker-state/volumes/live/3bc7d120-c34a-498f-46b4-c44d88d345db/volume","dst_path":"/scratch","mode":1}],"Network":"","Privileged":true,"Limits":{"bandwidth_limits":{},"cpu_limits":{},"disk_limits":{},"memory_limits":{},"pid_limits":{}}},"session":"1.4.11671"}}

A fix is to move process 1 in to an own cgroup in the entrypoint. Then setting the subtree controller succeeds.

combor commented 2 years ago

Hi, we are facing the same issue and the solution from https://github.com/concourse/concourse-docker/pull/77 seem to fix the problem but it looks like this is not released in the container yet. Do you know when it can be released?

taylorsilva commented 2 years ago

Not ETA on a release but it will definitely be in the next release, so 7.7.0

meezaan commented 2 years ago

Not ETA on a release but it will definitely be in the next release, so 7.7.0

@taylorsilva Do you have any recommendation on how to deal with this in the meantime? Building the docker image ourselves requires linux-rc to be available locally. Is that available anywhere? Thank you.

norbertkeri commented 2 years ago

I'm running into the same problem on 7.7.1, and @taylorsilva confirmed on Discord that the fix that is in the linked PR does not solve the issue if you are running the "quickstart" compose setup. If you look in your compose file, and it defines only 1 concourse container, with command: quickstart, the fix in the PR will not work.

I was pointed towards using https://github.com/concourse/concourse/blob/master/docker-compose.yml instead, but I haven't tried it yet.