containerd / nerdctl

contaiNERD CTL - Docker-compatible CLI for containerd, with support for Compose, Rootless, eStargz, OCIcrypt, IPFS, ...
Apache License 2.0
7.98k stars 594 forks source link

network appears to not be immediately available after a call to `network create` #3092

Open apostasie opened 3 months ago

apostasie commented 3 months ago

Description

There seems to be latency between network create return, and availability of the network - or between the container creation return and the operation attaching it to the network.

Specifically:

^ will fail occasionally.

Steps to reproduce the issue

Repeatedly create a network and immediately run a container attached to it, then destroy both in short succession.

Describe the results you received and expected

Errors out complaining about the network not existing.

This is being run as part of our test suites (see network remove tests).

    //    network_remove_linux_test.go:80: assertion failed: res.ExitCode is not exitCode: time="2024-06-16T00:53:11Z" 
    //   level=fatal msg="failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: 
    //  error during container init: error running hook #0: error running hook: exit status 1, stdout: , 
    //  stderr: time=\"2024-06-16T00:53:11Z\" level=fatal msg=\"no such network: \\\"nerdctl-testnetworkremovewithstoppedcontainer\\\"\"\n
    //  Failed to write to log, write /var/lib/nerdctl/1935db59/containers/nerdctl-test/ce14cea63aab995e164f80901fbcb4bf7eedfa0a2b2da66a1c3923a90f52c474/oci-hook.startContainer.log: file already closed: unknown"

What version of nerdctl are you using?

1.7.6

Are you using a variant of nerdctl? (e.g., Rancher Desktop)

None

Host information

No response

apostasie commented 3 months ago

Still not completely sure what is going on. The error sometimes comes as:

network_remove_linux_test.go:110: assertion failed: res.ExitCode is not exitCode: time="2024-06-16T00:39:09Z" level=fatal msg="container \"3a17eace71ae75ab7711011dc98e90e16d2bdf1af7ffc2b198399831236ef8b9\" in namespace \"nerdctl-test\": not found"

This message is being sent by containerd metadata service, very likely called by local.go / getContainer.

It is unclear to me why this is happening. Hypothesis is that the container was not created properly, then since it is --rm, somehow containerd thinks it should be deleted, but it does not exist. So maybe this message is indicative of a bug in containerd logic.

apostasie commented 2 months ago

Still not completely sure what is going on. The error sometimes comes as:

network_remove_linux_test.go:110: assertion failed: res.ExitCode is not exitCode: time="2024-06-16T00:39:09Z" level=fatal msg="container \"3a17eace71ae75ab7711011dc98e90e16d2bdf1af7ffc2b198399831236ef8b9\" in namespace \"nerdctl-test\": not found"

This message is being sent by containerd metadata service, very likely called by local.go / getContainer.

It is unclear to me why this is happening. Hypothesis is that the container was not created properly, then since it is --rm, somehow containerd thinks it should be deleted, but it does not exist. So maybe this message is indicative of a bug in containerd logic.

This second variant is fixed by #3192

The first error still needs to be investigated.