TestMultipleVMs: "bad file descriptor" cleaning up task bundle in agent

sipsma commented 4 years ago

Saw an intermittent failure when deleting a task in TestMultipleVMs here (see the failed "runtime isolated tests" ): https://buildkite.com/firecracker-microvm/firecracker-containerd/builds/1333#34210a80-f298-44d8-9ccf-1da67f783e5f

failed to cleanup task "container-12-0" exec "": 1 error occurred:
    * failed to remove bundle path "/container/container-12-0": unlinkat /container/container-12-0: bad file descriptor

It seems to be pretty rare (only happens to one exec every once in a while when running TestMultipleVMs). No strong idea what the issue is right away, but will look into whether the cleanup callbacks are referencing a file descriptor that could somehow be closed (maybe the wrapping os.File is getting garbage collected?)

sipsma commented 4 years ago

I've been looking into this more. While I've managed to reproduce it locally twice and have seen it in Buildkite again, it seems to be very rare. Have not managed to make it more reproducible.

As for what could the cause could be, the error itself can be happening on one of these lines (from within the go stdlib):

unlink(2) says that EBADF will be returned when dirfd is not a valid file descriptor. I didn't search exhaustively and am not sure if virtio drivers for the block-device could involved too, but this is the only place in the kernel where I can see EBADF returned for the unlinkat codepath. AFAICT it just means that the fd you provided to the syscall is not in your file descriptor table (i.e. you provided a wrong value or you provided a fd that has already been closed).

Our code does not manage the the FD though, we just provide a path as a string to os.RemoveAll, which internally opens the FD. I was thus wondering if it's possible the golang removeAll func could have a bug where the os.File wrapper for an FD value goes out of scope (which will invoke its finalizer, which calls close on the wrapped fd), but didn't see any possibility for that when reading the code.

My only other hunch is that our code or one of our dependencies could be managing FDs outside of an os.File. That would open two possibilities:

An FD is handled both with the raw int value and with an os.File wrapper. If close calls aren't managed carefully, then you could end up closing the raw file descriptor and then, once the os.File is garbage collected, that same FD value getting closed again.
- This occurs in the vsock library we use in agent, but so far from reading the code it seems to manage close calls very carefully and in a way that wouldn't result in this bug.
An FD is managed entirely as a raw int (without any os.File) and close gets called on it multiple times. After the first close call, it's possible something else opens up a new file that gets assigned the same FD num, which means subsequent calls to close will be closing the wrong file.
- This occurs in containerd shim code for both console pollers and oom event pollers. But reading the code I haven't found any bugs that could result in close getting called twice on the same fd.

I've been finding the above occurrences by building agent without the -s flag (which strips DWARF symbols), running objdump -d agent and then searching for unix.Close or other calls that could be tied into this. So far vsock and those containerd epollers are the only possibilities I've found.

It's also possible I'm on the entirely wrong track here, so the next step is going to be trying to reproduce the error with strace attached to agent (hopefully strace won't prevent it from happening..). I've set up a script to continuously run TestMultipleVMs with strace attached to agent over the weekend, hopefully we'll be able to catch the error happening, which should give a much better idea what's causing it.

sipsma commented 4 years ago

Did not manage to reproduce the exact error with strace yet (the run I started last week failed over the weekend due to my host accumulating kdmflush kernel processes and running out of pids...).

However, looking at code again I've found something highly suspicious related to the containerd console code I mentioned in the previous post here. The code that runs on every task delete checks if there are no tasks left and, if so, closes the platform, which in turn closes the console epoller, crucially using unix.Close directly on an FD. Even though the tests we run don't use any consoles, the initial epoller file is created anyways (verified with strace logs).

The problem here occurs when:

You create a task
You delete the task, resulting in the epoll file getting closed
You create another task
You delete the second task, resulting in the epoll file getting closed again

I wrote a different test case similar to TestMultipleVMs that creates containers in the VM in serial rather than parallel, and can now see a ton of logs from strace like:

close(5)                    = -1 EBADF (Bad file descriptor)

where FD 5 is was previously created by a call to epoll_create. EBADF being returned by close will be ignored by the code though, so the test cases so far have still passed even when this occurs.

So, to summarize, the agent process is capable of calling unix.Close on the runc v2 shim's console epoll FD value multiple times if it creates+deletes containers in serial. In theory, if this were to occur at the same time a container is being deleted (resulting in a call to os.RemoveAll), the close of the console epoll FD value could actually refer to an FD opened within the implementation of os.RemoveAll and result in unlinkat returning EBADF.

Going to try to make this more reproducible to confirm the idea.

sipsma commented 4 years ago

Okay, managed to reproduce this relatively quickly now by having a test case that spins up 100 VMs in parallel, first creates a single container in each VM, deletes it and then creates a bunch of containers in parallel. Hacky test code here: https://github.com/sipsma/firecracker-containerd/blob/d8f690056a1a05dc60d8d7a095b72ce7f6fb71c1/runtime/service_integ_test.go#L104

The strace output confirms what I described in the previous post (added comments explaining what's happening):

# the runc v2 shim's OOM (fd 3) and Console epoller (fd 5) are opened, respectively, during agent start
time="2019-12-17T00:54:43.038364226Z" level=debug msg="[    4.940249] strace[852]: [pid   888] epoll_create1(EPOLL_CLOEXEC) = 3" jailer=noop runtime=aws.firecracker vmID=112 vmm_stream=stdout
time="2019-12-17T00:54:43.041098250Z" level=debug msg="[    4.943036] strace[852]: [pid   888] epoll_create1(EPOLL_CLOEXEC) = 5" jailer=noop runtime=aws.firecracker vmID=112 vmm_stream=stdout
...
# The console epoller gets closed for the first time after the first container is deleted
time="2019-12-17T00:55:06.733991516Z" level=debug msg="[   28.627844] strace[852]: [pid   909] close(5)                    = 0" jailer=noop runtime=aws.firecracker vmID=112 vmm_stream=stdout
...
# A dirfd used internally by `os.RemoveAll` happens to end up as FD 5. A call to close the console epoller happens at the same time, resulting in the `unlinkat` call to return EBADF
time="2019-12-17T00:55:46.883608956Z" level=debug msg="[   68.786881] strace[852]: [pid  1013] openat(AT_FDCWD, \"/container\", O_RDONLY|O_CLOEXEC <unfinished ...>" jailer=noop runtime=aws.firecracker vmID=112 vmm_stream=stdout
time="2019-12-17T00:55:46.889037293Z" level=debug msg="[   68.792390] strace[852]: [pid  1013] <... openat resumed> )      = 5" jailer=noop runtime=aws.firecracker vmID=112 vmm_stream=stdout
time="2019-12-17T00:55:46.917471815Z" level=debug msg="[   68.821590] strace[852]: [pid   902] close(5 <unfinished ...>" jailer=noop runtime=aws.firecracker vmID=112 vmm_stream=stdout
time="2019-12-17T00:55:46.919968315Z" level=debug msg="[   68.823746] strace[852]: [pid  1013] unlinkat(5, \"container-112-4\", 0 <unfinished ...>" jailer=noop runtime=aws.firecracker vmID=112 vmm_stream=stdout
time="2019-12-17T00:55:46.922998860Z" level=debug msg="[   68.826228] strace[852]: [pid   902] <... close resumed> )       = 0" jailer=noop runtime=aws.firecracker vmID=112 vmm_stream=stdout
time="2019-12-17T00:55:46.937295925Z" level=debug msg="[   68.829208] strace[852]: [pid  1013] <... unlinkat resumed> )    = -1 EBADF (Bad file descriptor)" jailer=noop runtime=aws.firecracker vmID=112 vmm_stream=stdout

So pretty confident this is the issue (though it's always possible there are more bugs that could cause the same behavior of closing the wrong FD).

As for possible fixes, this seems like it should probably be considered a bug in containerd's runc v2 shim. However, we need to check with them as it's possible they explicitly did not want to handle this case where a shim stays alive after it has no containers left to manage and then starts managing more. If it turns out they explicitly don't want to handle this case, then we may need to look into other options such as updating agent to just maintain a separate shim service object per task.

sipsma commented 4 years ago

Just an update, I've sent out PRs to fix this in containerd code on two levels:

https://github.com/containerd/containerd/pull/3896 moves platform.Close to the Shutdown method of the runc v2 shim, ensuring platform functionality stays available until the shim is actually shutting down
https://github.com/containerd/console/pull/34 wraps the unix.Close of the epoller FD in a sync.Once, ensuring even if callers do make platform.Close multiple times, at least the FD re-use bug won't occur again

kzys commented 4 years ago

The two PRs above have been merged and cherry-picked for containerd 1.3.x. So, we can close this issue once we upgrade containerd to either master, release/1.3 or not-yet-released 1.3.3.

firecracker-microvm / firecracker-containerd

TestMultipleVMs: "bad file descriptor" cleaning up task bundle in agent #363