containerd / nerdctl

contaiNERD CTL - Docker-compatible CLI for containerd, with support for Compose, Rootless, eStargz, OCIcrypt, IPFS, ...
Apache License 2.0
7.86k stars 585 forks source link

nerdctl fails when running concurrently due to CNI errors: `CHAIN_USER_ADD failed (File exists): chain CNI-ISOLATION-STAGE-2` #2908

Open aojea opened 4 months ago

aojea commented 4 months ago

Description

See in https://github.com/kubernetes-sigs/kind/issues/3533

Command Output: time="2024-04-01T08:34:37Z" level=fatal msg="failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: time=\"2024-04-01T08:34:34Z\" level=fatal msg=\"failed to call cni.Setup: plugin type=\\\"firewall\\\" failed (add): running [/usr/sbin/iptables -t filter -N CNI-ISOLATION-STAGE-2 --wait]: exit status 4: iptables v1.8.7 (nf_tables):  CHAIN_USER_ADD failed (File exists): chain CNI-ISOLATION-STAGE-2\\n\"\nFailed to write to log, write /var/lib/nerdctl/1935db59/containers/default/18e88fcb538d49417539810b[25](https://github.com/kubernetes-sigs/kind/actions/runs/8505888448/job/23295140055?pr=3563#step:8:26)67922886e120771be00f165b5d64cf41a381f5/oci-hook.createRuntime.log: file already closed: unknown"

Stack Trace: 
sigs.k8s.io/kind/pkg/errors.WithStack
    sigs.k8s.io/kind/pkg/errors/errors.go:59
sigs.k8s.io/kind/pkg/exec.(*LocalCmd).Run
    sigs.k8s.io/kind/pkg/exec/local.go:124
sigs.k8s.io/kind/pkg/cluster/internal/providers/nerdctl.createContainerWithWaitUntilSystemdReachesMultiUserSystem
    sigs.k8s.io/kind/pkg/cluster/internal/providers/nerdctl/provision.go:383
sigs.k8s.io/kind/pkg/cluster/internal/providers/nerdctl.planCreation.func3
    sigs.k8s.io/kind/pkg/cluster/internal/providers/nerdctl/provision.go:123
sigs.k8s.io/kind/pkg/errors.UntilErrorConcurrent.func1
    sigs.k8s.io/kind/pkg/errors/concurrent.go:30
runtime.goexit
    runtime/asm_amd64.s:1598
Error: Process completed with exit code 1.

Steps to reproduce the issue

It seems that can be reproduced by running multiple containers in parallel, at one point the cni plugin will race and fail

Describe the results you received and expected

CNI is a nice and simple implementation for container networking, but for doing more complex operations it always fall short because of this simplicity. When trying to implement more advanced features, the chaining model executes different binaries that try to do different operations that may need to be synchronized across different containers. Docker or podman moved to different model from CNI, libnetwork and netvark because of this, though I don't think that this is completely necessary, and CNI is still able to handle this problems if nerdctl creates its own CNI plugin implementation instead of relying on the composition of multiple reference implementation plugins.

I'm happy to collaborate on this if needed, I'll just need a bit of bootstrapping on the requirements, but it does not seems a complicated problem

What version of nerdctl are you using?

NERDCTL_VERSION: 1.7.4

Are you using a variant of nerdctl? (e.g., Rancher Desktop)

None

Host information

No response

AkihiroSuda commented 4 months ago

Maybe we should just have a flock on calling CNI?

Docker or podman moved to different model from CNI, libnetwork and netvark because of this

nit: libnetwork predates CNI, and Docker had never implemented CNI

aojea commented 4 months ago

Maybe we should just have a flock on calling CNI?

yeah, that sounds simple enough, however, it seems a bug on the cni plugin itself , it should handle iptables concurrency

nit: libnetwork predates CNI, and Docker had never implemented CNI

yeah, just tried to highlight the diversity of opinions :)