Stargz containers fail to run in Kubernetes

rsmitty commented 11 months ago

Hi there!

I'm working on a system extension for stargz-snapshotter to run inside of Talos Linux (an OS specifically for Kubernetes). I've managed to get the snapshotter compiled and running when the system boots, but I seem to be having a problem when it comes to actually launching pods.

Using the pre-built image, a pod launch looks like this:

apiVersion: v1
kind: Pod
metadata:
  name: alpine-esgz
spec:
  containers:
  - name: alpine-esgz
    image: ghcr.io/stargz-containers/alpine:3.15.3-esgz
    imagePullPolicy: Always

The image seems to pull and then the pod fails with a RunContainerError like:

 Error: failed to create containerd task: failed to create shim task: OCI runtime create failed:
runc create failed: unable to start container process: exec: "/bin/sh": stat /bin/sh: no such file or directory: unknown

Running the pod with a defined command: pointing directly to /bin/sh fails in the same way. I haven't been able to find anything notable in the snapshotter logs or containerd logs, only errors like this in the snapshotter (which I think are just caused by the pod crashlooping):

192.168.1.111: {"dir":"/var/lib/containerd-stargz-grpc/snapshotter/snapshots/183/fs","error":"specified path \"/var/lib/containerd-stargz-grpc/snapshotter/snapshots/183/fs\" isn't a mountpoint","level":"debug","msg":"failed to unmount","time":"2023-10-10T17:29:36.564820765Z"}

It should be noted that with Talos, these system extensions run as a container and I'm mounting /dev, /var, and /run into this container.

My config.toml for the snapshotter is currently empty. The containerd config looks like the following, which is a merge of the configs required from the snapshotter docs, as well as the Talos defaults:

version = 2

[plugins]
  [plugins."io.containerd.grpc.v1.cri"]
    [plugins."io.containerd.grpc.v1.cri".containerd]
      disable_snapshot_annotations = false
      snapshotter = "stargz"
      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
          discard_unpacked_layers = true
          runtime_type = "io.containerd.runc.v2"
    [plugins."io.containerd.grpc.v1.cri".registry]
      config_path = "/etc/cri/conf.d/hosts"
      [plugins."io.containerd.grpc.v1.cri".registry.configs]

[proxy_plugins]
  [proxy_plugins.stargz]
    address = "/run/containerd-stargz-grpc/containerd-stargz-grpc.sock"
    type = "snapshot"

It feels like I'm missing a simple mount or a configuration somewhere and I'm wondering if anyone may have seen this before and can help push me in the right direction. Thanks!

ktock commented 11 months ago

It should be noted that with Talos, these system extensions run as a container and I'm mounting /dev, /var, and /run into this container.

@rsmitty How the mount propagation is configured? All mount events under /var/lib/containerd-stargz-grpc needs to be shared to containerd's namespace. So something like rshared is needed for the bind mount (docker example).

rsmitty commented 11 months ago

Hey @ktock, thanks for chiming in. Indeed, the mount for /var should be correct I believe. It's mounted with the following options:

    - source: /var
      destination: /var
      type: bind
      options:
        - rshared
        - rbind
        - rw

Are there any other paths than /dev, /var, /run that need to be mounted up from the host?

rsmitty commented 11 months ago

Here's the containerd logs surrounding the launch of one of the containers. Doesn't really seem to help much other than the same path-related error and it doesn't give me the feeling that containerd isn't finding the image layers or anything of that nature:

192.168.1.111: {"level":"info","msg":"PullImage \"ghcr.io/stargz-containers/alpine:3.15.3-esgz\" returns image reference \"sha256:d087dacb46e24b2791f34f832582114a7309b0c2613c56d83e4e96d6d04b88a7\"","time":"2023-10-11T14:47:59.252936653Z"}
192.168.1.111: {"level":"info","msg":"CreateContainer within sandbox \"c9374e087cf0b85f7560acf58a0f7863032e72404f990551f00073753b0afbab\" for container \u0026ContainerMetadata{Name:ubu-esgz,Attempt:67,}","time":"2023-10-11T14:47:59.254881392Z"}
192.168.1.111: {"level":"info","msg":"CreateContainer within sandbox \"c9374e087cf0b85f7560acf58a0f7863032e72404f990551f00073753b0afbab\" for \u0026ContainerMetadata{Name:ubu-esgz,Attempt:67,} returns container id \"dcd65a7139709ee9a35e7b0fdc6f07e6b4e6491bb916e02b9f8de5e4f0f16134\"","time":"2023-10-11T14:47:59.575791738Z"}
192.168.1.111: {"level":"info","msg":"StartContainer for \"dcd65a7139709ee9a35e7b0fdc6f07e6b4e6491bb916e02b9f8de5e4f0f16134\"","time":"2023-10-11T14:47:59.576566385Z"}
192.168.1.111: {"id":"dcd65a7139709ee9a35e7b0fdc6f07e6b4e6491bb916e02b9f8de5e4f0f16134","level":"info","msg":"shim disconnected","time":"2023-10-11T14:47:59.659983960Z"}
192.168.1.111: {"id":"dcd65a7139709ee9a35e7b0fdc6f07e6b4e6491bb916e02b9f8de5e4f0f16134","level":"warning","msg":"cleaning up after shim disconnected","namespace":"k8s.io","time":"2023-10-11T14:47:59.660261183Z"}
192.168.1.111: {"level":"info","msg":"cleaning up dead shim","time":"2023-10-11T14:47:59.660339024Z"}
192.168.1.111: {"level":"warning","msg":"cleanup warnings time=\"2023-10-11T14:47:59Z\" level=info msg=\"starting signal loop\" namespace=k8s.io pid=24244 runtime=io.containerd.runc.v2\ntime=\"2023-10-11T14:47:59Z\" level=warning msg=\"failed to read init pid file\" error=\"open /run/containerd/io.containerd.runtime.v2.task/k8s.io/dcd65a7139709ee9a35e7b0fdc6f07e6b4e6491bb916e02b9f8de5e4f0f16134/init.pid: no such file or directory\" runtime=io.containerd.runc.v2\n","time":"2023-10-11T14:47:59.671494840Z"}
192.168.1.111: {"error":"read /proc/self/fd/131: file already closed","level":"error","msg":"copy shim log","time":"2023-10-11T14:47:59.671844583Z"}
192.168.1.111: {"error":"reading from a closed fifo","level":"error","msg":"Failed to pipe stdout of container \"dcd65a7139709ee9a35e7b0fdc6f07e6b4e6491bb916e02b9f8de5e4f0f16134\"","time":"2023-10-11T14:47:59.672125966Z"}
192.168.1.111: {"error":"reading from a closed fifo","level":"error","msg":"Failed to pipe stderr of container \"dcd65a7139709ee9a35e7b0fdc6f07e6b4e6491bb916e02b9f8de5e4f0f16134\"","time":"2023-10-11T14:47:59.672412169Z"}
192.168.1.111: {"error":"failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: \"/bin/sh\": stat /bin/sh: no such file or directory: unknown","level":"error","msg":"StartContainer for \"dcd65a7139709ee9a35e7b0fdc6f07e6b4e6491bb916e02b9f8de5e4f0f16134\" failed","time":"2023-10-11T14:47:59.672990064Z"}
192.168.1.111: {"level":"info","msg":"RemoveContainer for \"f24af168123a416305cb037bc04c1d56d142ba7c861827057fef931e035c40b9\"","time":"2023-10-11T14:48:00.567123401Z"}
192.168.1.111: {"level":"info","msg":"RemoveContainer for \"f24af168123a416305cb037bc04c1d56d142ba7c861827057fef931e035c40b9\" returns successfully","time":"2023-10-11T14:48:00.569010259Z"}

ktock commented 11 months ago

@rsmitty Thanks for the information.

It's mounted with the following options:

What runtime is used with these options? containerd?

Are there any other paths than /dev, /var, /run that need to be mounted up from the host?

They should be enough.

I have some questions about the mounts: After pulling an estargz image from the registry to the node, does mount | grep stargz print some estargz FUSE mounts on both of the snapshotter container and the host? And are the contents under /var/lib/containerd-stargz-grpc/snapshotter/snapshots/*/fs/ visible from the host?

rsmitty commented 11 months ago

What runtime is used with these options? containerd?

Yes, containerd.

I have some questions about the mounts: After pulling an estargz image from the registry to the node, does mount | grep stargz print some estargz FUSE mounts on both of the snapshotter container and the host? And are the contents under /var/lib/containerd-stargz-grpc/snapshotter/snapshots/*/fs/ visible from the host?

That is a good question. On the host, I can see that there are contents under the fs directory for the snapshots (there's lots of them it appears). That said, I can't see any mounts when grepping for stargz on the host. Maybe I'm missing something there? Anything special I should have configured for fuse? I'm building that from source as well, as Talos doesn't have a package manager.

ktock commented 11 months ago

@rsmitty Thanks for the information.

I can't see any mounts when grepping for stargz on the host.

If FUSE mountpoints (e.g. mount | grep stargz) are visible from the snapshotter container but are invisible from the host, then they don't seem to be propagated.

What does cat /proc/self/mountinfo in the snapshotter container and on the host show about mount propagation relationship between the container's /var/ mountpoint and the host filesystem? It shows propagation information like shared or master, etc.

maxpain commented 11 months ago

Any updates?

rsmitty commented 11 months ago

Okay, got some more time to hack on this today. I was able to see that, indeed, it seems to be a problem with mount propagation.

From the host, I can't see any mounts that are related to fuse.

From the stargz container, however, I can see:

192.168.1.111: 158 136 0:97 / /var/lib/containerd-stargz-grpc/snapshotter/snapshots/52/fs rw,nodev,relatime shared:56 - fuse.rawBridge stargz rw,user_id=0,group_id=0,allow_other

That said, I'm not quite sure what I'm missing here. /var is mounted into the container with rshared, rbind, and rw as mentioned above. My understanding of the mount docs is that this should propagate as expected. Any ideas on where I might be falling over at this point?

ktock commented 11 months ago

@rsmitty Thanks for the information. Both of your host and container mountpoints need to be marked as shared to propagate mount events each other. What's the actual propagation flag added to /var/ (cat /proc/self/mountinfo | grep /var/ shows this) in the stargz-snapshotter container? And what is your host / (or /var/) 's actual propagation flag (can be inspected from /proc/self/mountinfo on the host)? If the host is marked with non-shared flag, mount events won't be propagated there.

rsmitty commented 10 months ago

Going to close this for now, as I've been able to prove that switching into the PID 1 mount namespace makes this work as expected. Something on the Talos Linux side with mount propagation seems to be the most likely culprit. Thx for the help @ktock

containerd / stargz-snapshotter

Stargz containers fail to run in Kubernetes #1414