cloudfoundry / guardian

containers4life
Apache License 2.0
76 stars 42 forks source link

gdn fail with runc error in ubuntu 2204 lts #301

Closed xtremerui closed 5 months ago

xtremerui commented 2 years ago

Description

When running Concourse binary (using gdn for containization) in google VM with ubuntu-2204-lts family as OS image, we see errors as below

Aug 25 21:56:12 smoke-splendid-earwig concourse[4460]: {"timestamp":"2022-08-25T21:56:12.809930620Z","level":"error","source":"guardian","message":"guardian.create.containerizer-create.runtime-create-failed","data":{"error":"runc run: exit status 1: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: rootfs_linux.go:76: mounting \"cgroup\" to rootfs at \"/sys/fs/cgroup\" caused: invalid argument","handle":"a17876d5-647e-492d-6ae2-311b1a56d718","session":"40.3"}}

For comparison, when running Concourse by docker compose locally we don't see the error. The OS image is the same as the VM in GCP

root@c29ddbf435bd:/src# cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04.1 LTS"

but is kernel is 5.10.47-linuxkit.

Also, when running Concourse with containerd runtime that directly using runc v1.1.4 we dont see error in both local docker or gcp VM.

Maybe it is related to the older runc that is currently used in guardian where it might not work well with specific newer kernel in ubuntu Jammy jellyfish?

cf-gitbot commented 2 years ago

We have created an issue in Pivotal Tracker to manage this. Unfortunately, the Pivotal Tracker project is private so you may be unable to view the contents of the story.

The labels on this github issue will be updated when the story is started.

MarcPaquette commented 2 years ago

This issue is being worked on under the Garden-runc-release/#233 issue

dtimm commented 1 year ago

It looks like this is the same issue that other contain runtimes have had with Jammy: https://github.com/containers/podman/issues/12559 .

Jammy uses cgroupv2 in the kernel, and it delegates cgroup authority to sub-processes (like the container runtime) as cgroupv2. runc supports cgroupv2 as of v1.0.0 release, but gdn is also directly altering cgroups using the old v1 schema: https://github.com/cloudfoundry/guardian/blob/8deac7e439aca41e515a74d7c8489081b8961b97/guardiancmd/command_linux.go#L307

This will require some substantial changes in how cgroups are managed in guardian in order to support new distributions that have switched to cgroupv2.

xtremerui commented 1 year ago

Some updates:

Concourse with latest gdn can run successfully on an image with cgroups v1 enabled based on gcloud image family ubuntu-2204-lts .

MarcPaquette commented 1 year ago

Hi @xtremerui , Is this issue still outstanding for you or did the newer image resolve it for you?

xtremerui commented 1 year ago

@MarcPaquette the image with cgroups v1 enabled works for us. We still hoping gdn works for an image with cgroups v2 available only.

dsabeti commented 9 months ago

@xtremerui Our team is starting to scope the work to use cgroups v2 only. We'll keep you updated as that work starts to get done.

xtremerui commented 9 months ago

@dsabeti this is great news! Thank you and the team.

MarcPaquette commented 8 months ago

Looking into this, we'd need to get a new stemcell built to allow the usage of cgroup v2. Currently the bosh stemcell builder is forcing us to use v1: https://github.com/cloudfoundry/bosh-linux-stemcell-builder/blob/57cd1eb14ddebd9666f15e83ecfa18f31350d45f/stemcell_builder/stages/image_install_grub/apply.sh#L89

I'm working on discussing this with Product Management.

MarcPaquette commented 5 months ago

I'm going to close out this issue, as it's a known issue and we have future plans to resolve it. We're waiting on the Stemcell builds that enable this feature by default.