cncf-tags / container-device-interface

Apache License 2.0
214 stars 39 forks source link

Sort mounts with stable sort #240

Open Epsilon314 opened 1 week ago

Epsilon314 commented 1 week ago

We got an unexpected change of the order of mounts after introducing cdi (indirectly through updating Nvidia container toolkit), which in specific cases lead to failed dind container creation. The change of mounts order is related to the way sortMounts works, as referred below: https://github.com/cncf-tags/container-device-interface/blob/main/pkg/cdi/container-edits.go#L358

// sortMounts sorts the mounts in the given OCI Spec.
func sortMounts(specgen *ocigen.Generator) {
    mounts := specgen.Mounts()
    specgen.ClearMounts()
    sort.Sort(orderedMounts(mounts))
    specgen.Config.Mounts = mounts
}

// orderedMounts defines how to sort an OCI Spec Mount slice.
// This is the almost the same implementation sa used by CRI-O and Docker,
// with a minor tweak for stable sorting order (easier to test):
//
//  https://github.com/moby/moby/blob/17.05.x/daemon/volumes.go#L26
type orderedMounts []oci.Mount

// Len returns the number of mounts. Used in sorting.
func (m orderedMounts) Len() int {
    return len(m)
}

// Less returns true if the number of parts (a/b/c would be 3 parts) in the
// mount indexed by parameter 1 is less than that of the mount indexed by
// parameter 2. Used in sorting.
func (m orderedMounts) Less(i, j int) bool {
    ip, jp := m.parts(i), m.parts(j)
    if ip < jp {
        return true
    }
    if jp < ip {
        return false
    }
    return m[i].Destination < m[j].Destination
}
  1. Go1.19 changes sort.Sort from stable to unstable, and we can use sort.Stable instead (w.r.t https://github.com/containerd/containerd/blob/main/internal/cri/opts/spec_linux_opts.go#L68);
  2. We can change orderedMounts.Less method to only compare parts, so that the mount order submitted by container spec yaml will be kept, in case it matters.
elezar commented 1 week ago

@Epsilon314 thanks for calling this out. Do you have example mounts so that we can include those cases in our tests while addressing this?

Epsilon314 commented 1 week ago

@elezar We encounter this issue when creating a docker in docker container:

  1. Mount to container path /run/containerd and /proc/cpuinfo
  2. Create another DIND container in the first container. If /proc/cpuinfo are mounted before /run/containerd, runc can see that /run/containerd has submountpoint /run/containerd/xxx/rootfs/proc/cpuinfo and runc will rbind mount all submountpoints. In this case we failed to create the DIND container, since that extra submount will cause mounting proc for DIND container to fail, restricted by the mnt_already_visible check in kernel.

Add a unit test case PR to explain the desired result.

klihub commented 1 week ago

@elezar We encounter this issue when creating a docker in docker container:

  1. Mount to container path /run/containerd and /proc/cpuinfo
  2. Create another DIND container in the first container. If /proc/cpuinfo are mounted before /run/containerd, runc can see that /run/containerd has submountpoint /run/containerd/xxx/rootfs/proc/cpuinfo and runc will rbind mount all submountpoints. In this case we failed to create the DIND container, since that extra submount will cause mounting proc for DIND container to fail, restricted by the mnt_already_visible check in kernel.

Add a unit test case PR to explain the desired result.

Okay, so was this a problem that was only reproducible in nested containers ? Because otherwise, in a normal non-nested case it should be guaranteed that equally long container paths (in terms of directory components in the path) cannot have mount ordering dependencies on each other, only longer paths can have on shorter ones, right ?

Epsilon314 commented 1 week ago

@klihub i think so

elezar commented 1 week ago

@Epsilon314 the PR looks good to me. Thanks.

One question is how to distribute this change. For clients such as the NVIDIA Container Toolkit that was mentioned, pulling in a new version of the CDI packages should not be a problem. Do we need to do anything specific for other clients where we may be consuming an older CDI package version? @klihub do you have thoughts on that?

klihub commented 1 week ago

@Epsilon314 the PR looks good to me. Thanks.

One question is how to distribute this change. For clients such as the NVIDIA Container Toolkit that was mentioned, pulling in a new version of the CDI packages should not be a problem. Do we need to do anything specific for other clients where we may be consuming an older CDI package version? @klihub do you have thoughts on that?

I think that if this only bites folks who run a runtime-in-a-runtime, usually DIND, then this probably has a direct effect on a marginally small set of people. So I wouldn't worry about it unless more bug reports about this start flowing in. IOW, I'd make this part of our 1.0 release, then update containerd 2.0, 1.7, and cri-o main and the latest two minor release series for a starter.