fix: Cleaning up orphaned directories and files when containers creat…

haytok commented 1 month ago

…ion fails

When trying to create new containers, the following directories are created based on the new container ID:

/var/lib/nerdctl/1935db59/containers/<namespace>/<container id>
/var/lib/nerdctl/1935db59/etchosts/<namespace>/<container id>

If containers with existing names are attempted to be created, the processes fail. However, in the events of failures, these directories are not cleaned up.

This issue is reported in the following:

https://github.com/containerd/nerdctl/issues/2993

This commit resolves the issue by cleaning up the mentioned directories when containers creation fails.

haytok commented 1 month ago

Hi, @djdongjin

Thanks for comments !!!

Is this testable? E.g., can we add a test in https://github.com/containerd/nerdctl/blob/772feb40ee9d0d51db6da8ac3a8f737bbb48475f/cmd/nerdctl/container/container_create_linux_test.go?

I think it is difficult to implement tests based on the number of files in the directory under /var/lib/nerdctl/1935db59/ in unit tests.

Is there any E2E test that can perform tests based on the number of local files ... ????

apostasie commented 1 month ago

Is there any E2E test that can perform tests based on the number of local files ... ????

Here is an e2e test for you that does it (the tricky part is to access the right location):

func TestIssue2993(t *testing.T) {
    nerdtest.Setup()

    testCase := &test.Case{
        Description: "Issue #2993 - nerdctl is no longer leaking etchosts dirs & files on failure",
        // Make sure we are in a private env so that other tests do not pollute this
        Require: nerdtest.Private,
        // Run a first container successfully
        Setup: func(data test.Data, helpers test.Helpers) {
            helpers.Ensure("run", "--name", data.Identifier(), testutil.CommonImage)
        },
        Cleanup: func(data test.Data, helpers test.Helpers) {
            helpers.Anyhow("rm", "-f", data.Identifier())
        },
        // Try to run a second container with the same name
        Command: func(data test.Data, helpers test.Helpers) test.Command {
            return helpers.Command("run", "--name", data.Identifier(), testutil.CommonImage)
        },
        Expected: func(data test.Data, helpers test.Helpers) *test.Expected {
            return &test.Expected{
                // We expect the command to fail
                ExitCode: 1,
                // As the name is already used
                Errors: []error{errors.New("is already used by")},
                Output: func(stdout string, info string, t *testing.T) {
                    // Get the test DataRoot, the containerd address, and get the dataStore location
                    base := testutil.NewBase(t)
                    dataStore, err := clientutil.DataStore(string(data.ReadConfig(nerdtest.DataRoot)), base.ContainerdAddress())
                    assert.NilError(t, err, "data store failed")
                    // Now, we are interested in "etchosts", and in the Namespace (data.Identifier() with `nerdtest.Private`)
                    etchostsdir := filepath.Join(dataStore, "etchosts", data.Identifier())
                    // Read the dir
                    files, err := os.ReadDir(etchostsdir)
                    assert.NilError(t, err, fmt.Sprintf("failed reading %s", etchostsdir))
                    // We are expecting only one directory there, as the other container should have deleted its directory on failure
                    assert.Equal(t, len(files), 1, "there should be only 1 subdirectory in the etchosts store")
                },
            }
        },
    }

    testCase.Run(t)
}

Obviously, the important part is the assert.Equal(t, len(files), 1 which verifies that the failed container did not leave a second directory behind.

haytok commented 1 month ago

Thanks !!!!!!! I'll add test for this bug.

apostasie commented 1 month ago

Pending addition of a test validating the fix, LGTM.

CI failures are unrelated (windows regression #3437).

haytok commented 1 month ago

Added tests work well in my local env, but they fail in GitHub Actions.

Details

``` === RUN TestIssue2993 [444](https://github.com/containerd/nerdctl/actions/runs/11031877260/job/30639560230?pr=3457#step:7:445)=== RUN TestIssue2993/Issue_#2993_-_nerdctl_no_longer_leaks_containers_and_etchosts_directories_and_files_when_container_creation_fails_or_a_container_is_removed. [445](https://github.com/containerd/nerdctl/actions/runs/11031877260/job/30639560230?pr=3457#step:7:446) container_create_linux_test.go:216: assertion failed: error is not nil: open /tmp/1935db59/containers/nerdctl-test: no such file or directory [446](https://github.com/containerd/nerdctl/actions/runs/11031877260/job/30639560230?pr=3457#step:7:447)=== RUN TestIssue2993/Issue_#2993_-_nerdctl_no_longer_leaks_containers_and_etchosts_directories_and_files_when_container_creation_fails_or_a_container_is_removed.#01 [447](https://github.com/containerd/nerdctl/actions/runs/11031877260/job/30639560230?pr=3457#step:7:448) container_create_linux_test.go:235: assertion failed: error is not nil: open /tmp/1935db59/containers/nerdctl-test: no such file or directory [448](https://github.com/containerd/nerdctl/actions/runs/11031877260/job/30639560230?pr=3457#step:7:449)--- FAIL: TestIssue2993 (0.10s) [449](https://github.com/containerd/nerdctl/actions/runs/11031877260/job/30639560230?pr=3457#step:7:450) --- FAIL: TestIssue2993/Issue_#2993_-_nerdctl_no_longer_leaks_containers_and_etchosts_directories_and_files_when_container_creation_fails_or_a_container_is_removed. (0.06s) [450](https://github.com/containerd/nerdctl/actions/runs/11031877260/job/30639560230?pr=3457#step:7:451) --- FAIL: TestIssue2993/Issue_#2993_-_nerdctl_no_longer_leaks_containers_and_etchosts_directories_and_files_when_container_creation_fails_or_a_container_is_removed.#01 (0.00s) ```

There seems to be a problem with the creation of the data-root (/tmp/1935db59/) in the test, so I'll investigate.

If there are any changes that need to be made about the test, I would like your advise

Updated

I forgot to run go test -p 1 ../../../cmd/nerdctl/... -run "TestIssue2993" -args -test.target=docker in my local env.

nerdctl/docs/testing/README.md at main · containerd/nerdctl

When running the command, I can check the failure of tests occured in GitHub Actions.

Details

```bash /Users/haytok/workspace/nerdctl/cmd/nerdctl/container [fix-to-clean-up-orphaned-files] > go test -p 1 ../../../cmd/nerdctl/... -run "TestIssue2993" -args -test.target=docker ok github.com/containerd/nerdctl/v2/cmd/nerdctl 0.005s [no tests to run] ok github.com/containerd/nerdctl/v2/cmd/nerdctl/apparmor 0.004s [no tests to run] ok github.com/containerd/nerdctl/v2/cmd/nerdctl/builder 0.004s [no tests to run] ok github.com/containerd/nerdctl/v2/cmd/nerdctl/completion 0.004s [no tests to run] ok github.com/containerd/nerdctl/v2/cmd/nerdctl/compose 0.005s [no tests to run] test target: "docker" [rm --data-root /tmp -f nerdctl-testissue2993] [run --name nerdctl-testissue2993 --data-root /tmp -d ghcr.io/stargz-containers/alpine:3.13-org sleep infinity] --- FAIL: TestIssue2993 (0.04s) --- FAIL: TestIssue2993/Issue_#2993_-_nerdctl_no_longer_leaks_containers_and_etchosts_directories_and_files_when_container_creation_fails_or_a_container_is_removed. (0.02s) container_create_linux_test.go:216: assertion failed: error is not nil: open /tmp/1935db59/containers/nerdctl-test: no such file or directory --- FAIL: TestIssue2993/Issue_#2993_-_nerdctl_no_longer_leaks_containers_and_etchosts_directories_and_files_when_container_creation_fails_or_a_container_is_removed.#01 (0.00s) container_create_linux_test.go:235: assertion failed: error is not nil: open /tmp/1935db59/containers/nerdctl-test: no such file or directory FAIL FAIL github.com/containerd/nerdctl/v2/cmd/nerdctl/container 0.042s ? github.com/containerd/nerdctl/v2/cmd/nerdctl/helpers [no test files] ok github.com/containerd/nerdctl/v2/cmd/nerdctl/image 0.005s [no tests to run] ok github.com/containerd/nerdctl/v2/cmd/nerdctl/inspect 0.004s [no tests to run] ? github.com/containerd/nerdctl/v2/cmd/nerdctl/internal [no test files] ok github.com/containerd/nerdctl/v2/cmd/nerdctl/ipfs 0.004s [no tests to run] ok github.com/containerd/nerdctl/v2/cmd/nerdctl/login 0.004s [no tests to run] ok github.com/containerd/nerdctl/v2/cmd/nerdctl/namespace 0.004s [no tests to run] ok github.com/containerd/nerdctl/v2/cmd/nerdctl/network 0.005s [no tests to run] ok github.com/containerd/nerdctl/v2/cmd/nerdctl/system 0.007s [no tests to run] ok github.com/containerd/nerdctl/v2/cmd/nerdctl/volume 0.005s [no tests to run] FAIL ```

Therefore, I'll investigate how to retrieve dataStore in test code.

haytok commented 1 month ago

Hi, @apostasie

Thanks for the review.

I have added tests for this issue, so could you re-review when you have time ?

CI failure is related to this comment (https://github.com/containerd/nerdctl/pull/3457#issuecomment-2373055605)

haytok commented 1 month ago

The local branch was out of date and I was unaware that there was documentation for the test. Thanks for sharing! I will check the documentation and fix it.

apostasie commented 1 month ago

@haytok can you squash?

Reviewers, about the CI failures:

https://github.com/containerd/nerdctl/actions/runs/11074391386/job/30773073228?pr=3457#step:6:537 - container_run_mount_linux_test.go:367: assertion failed: error is not nil: invalid /proc/mounts format "" - no idea :/ - fluke?
https://github.com/containerd/nerdctl/actions/runs/11074391386/job/30773072030?pr=3457#step:6:2968
- this is #3396
https://github.com/containerd/nerdctl/actions/runs/11074391387/job/30773073216?pr=3457#step:9:568
- this is #3437

haytok commented 1 month ago

Hi, @apostasie

I have squashed all commits.

apostasie commented 1 month ago

@haytok appreciate all your efforts here! Looks good. Now to maintainers. @djdongjin @AkihiroSuda

containerd / nerdctl

fix: Cleaning up orphaned directories and files when containers creat… #3457