Closed sipsma closed 4 years ago
I've been looking into this more. While I've managed to reproduce it locally twice and have seen it in Buildkite again, it seems to be very rare. Have not managed to make it more reproducible.
As for what could the cause could be, the error itself can be happening on one of these lines (from within the go stdlib):
unlink(2)
says that EBADF will be returned when dirfd is not a valid file descriptor.
I didn't search exhaustively and am not sure if virtio drivers for the block-device could involved too, but this is the only place in the kernel where I can see EBADF
returned for the unlinkat
codepath. AFAICT it just means that the fd you provided to the syscall is not in your file descriptor table (i.e. you provided a wrong value or you provided a fd that has already been closed).
Our code does not manage the the FD though, we just provide a path as a string to os.RemoveAll
, which internally opens the FD. I was thus wondering if it's possible the golang removeAll
func could have a bug where the os.File
wrapper for an FD value goes out of scope (which will invoke its finalizer, which calls close
on the wrapped fd), but didn't see any possibility for that when reading the code.
My only other hunch is that our code or one of our dependencies could be managing FDs outside of an os.File
. That would open two possibilities:
os.File
wrapper. If close
calls aren't managed carefully, then you could end up closing the raw file descriptor and then, once the os.File
is garbage collected, that same FD value getting closed again.
close
calls very carefully and in a way that wouldn't result in this bug.os.File
) and close
gets called on it multiple times. After the first close call, it's possible something else opens up a new file that gets assigned the same FD num, which means subsequent calls to close will be closing the wrong file.
close
getting called twice on the same fd.I've been finding the above occurrences by building agent
without the -s
flag (which strips DWARF symbols), running objdump -d agent
and then searching for unix.Close
or other calls that could be tied into this. So far vsock and those containerd epollers are the only possibilities I've found.
It's also possible I'm on the entirely wrong track here, so the next step is going to be trying to reproduce the error with strace attached to agent (hopefully strace won't prevent it from happening..). I've set up a script to continuously run TestMultipleVMs
with strace attached to agent over the weekend, hopefully we'll be able to catch the error happening, which should give a much better idea what's causing it.
Did not manage to reproduce the exact error with strace yet (the run I started last week failed over the weekend due to my host accumulating kdmflush
kernel processes and running out of pids...).
However, looking at code again I've found something highly suspicious related to the containerd console code I mentioned in the previous post here. The code that runs on every task delete checks if there are no tasks left and, if so, closes the platform, which in turn closes the console epoller, crucially using unix.Close
directly on an FD. Even though the tests we run don't use any consoles, the initial epoller file is created anyways (verified with strace logs).
The problem here occurs when:
I wrote a different test case similar to TestMultipleVMs
that creates containers in the VM in serial rather than parallel, and can now see a ton of logs from strace like:
close(5) = -1 EBADF (Bad file descriptor)
where FD 5
is was previously created by a call to epoll_create
. EBADF
being returned by close
will be ignored by the code though, so the test cases so far have still passed even when this occurs.
So, to summarize, the agent
process is capable of calling unix.Close
on the runc v2 shim's console epoll FD value multiple times if it creates+deletes containers in serial. In theory, if this were to occur at the same time a container is being deleted (resulting in a call to os.RemoveAll
), the close of the console epoll FD value could actually refer to an FD opened within the implementation of os.RemoveAll
and result in unlinkat
returning EBADF
.
Going to try to make this more reproducible to confirm the idea.
Okay, managed to reproduce this relatively quickly now by having a test case that spins up 100 VMs in parallel, first creates a single container in each VM, deletes it and then creates a bunch of containers in parallel. Hacky test code here: https://github.com/sipsma/firecracker-containerd/blob/d8f690056a1a05dc60d8d7a095b72ce7f6fb71c1/runtime/service_integ_test.go#L104
The strace output confirms what I described in the previous post (added comments explaining what's happening):
# the runc v2 shim's OOM (fd 3) and Console epoller (fd 5) are opened, respectively, during agent start
time="2019-12-17T00:54:43.038364226Z" level=debug msg="[ 4.940249] strace[852]: [pid 888] epoll_create1(EPOLL_CLOEXEC) = 3" jailer=noop runtime=aws.firecracker vmID=112 vmm_stream=stdout
time="2019-12-17T00:54:43.041098250Z" level=debug msg="[ 4.943036] strace[852]: [pid 888] epoll_create1(EPOLL_CLOEXEC) = 5" jailer=noop runtime=aws.firecracker vmID=112 vmm_stream=stdout
...
# The console epoller gets closed for the first time after the first container is deleted
time="2019-12-17T00:55:06.733991516Z" level=debug msg="[ 28.627844] strace[852]: [pid 909] close(5) = 0" jailer=noop runtime=aws.firecracker vmID=112 vmm_stream=stdout
...
# A dirfd used internally by `os.RemoveAll` happens to end up as FD 5. A call to close the console epoller happens at the same time, resulting in the `unlinkat` call to return EBADF
time="2019-12-17T00:55:46.883608956Z" level=debug msg="[ 68.786881] strace[852]: [pid 1013] openat(AT_FDCWD, \"/container\", O_RDONLY|O_CLOEXEC <unfinished ...>" jailer=noop runtime=aws.firecracker vmID=112 vmm_stream=stdout
time="2019-12-17T00:55:46.889037293Z" level=debug msg="[ 68.792390] strace[852]: [pid 1013] <... openat resumed> ) = 5" jailer=noop runtime=aws.firecracker vmID=112 vmm_stream=stdout
time="2019-12-17T00:55:46.917471815Z" level=debug msg="[ 68.821590] strace[852]: [pid 902] close(5 <unfinished ...>" jailer=noop runtime=aws.firecracker vmID=112 vmm_stream=stdout
time="2019-12-17T00:55:46.919968315Z" level=debug msg="[ 68.823746] strace[852]: [pid 1013] unlinkat(5, \"container-112-4\", 0 <unfinished ...>" jailer=noop runtime=aws.firecracker vmID=112 vmm_stream=stdout
time="2019-12-17T00:55:46.922998860Z" level=debug msg="[ 68.826228] strace[852]: [pid 902] <... close resumed> ) = 0" jailer=noop runtime=aws.firecracker vmID=112 vmm_stream=stdout
time="2019-12-17T00:55:46.937295925Z" level=debug msg="[ 68.829208] strace[852]: [pid 1013] <... unlinkat resumed> ) = -1 EBADF (Bad file descriptor)" jailer=noop runtime=aws.firecracker vmID=112 vmm_stream=stdout
So pretty confident this is the issue (though it's always possible there are more bugs that could cause the same behavior of closing the wrong FD).
As for possible fixes, this seems like it should probably be considered a bug in containerd's runc v2 shim. However, we need to check with them as it's possible they explicitly did not want to handle this case where a shim stays alive after it has no containers left to manage and then starts managing more. If it turns out they explicitly don't want to handle this case, then we may need to look into other options such as updating agent to just maintain a separate shim service object per task.
Just an update, I've sent out PRs to fix this in containerd code on two levels:
platform.Close
to the Shutdown method of the runc v2 shim, ensuring platform functionality stays available until the shim is actually shutting downunix.Close
of the epoller FD in a sync.Once
, ensuring even if callers do make platform.Close
multiple times, at least the FD re-use bug won't occur againThe two PRs above have been merged and cherry-picked for containerd 1.3.x. So, we can close this issue once we upgrade containerd to either master, release/1.3 or not-yet-released 1.3.3.
Saw an intermittent failure when deleting a task in
TestMultipleVMs
here (see the failed "runtime isolated tests" ): https://buildkite.com/firecracker-microvm/firecracker-containerd/builds/1333#34210a80-f298-44d8-9ccf-1da67f783e5fIt seems to be pretty rare (only happens to one exec every once in a while when running TestMultipleVMs). No strong idea what the issue is right away, but will look into whether the cleanup callbacks are referencing a file descriptor that could somehow be closed (maybe the wrapping
os.File
is getting garbage collected?)