Closed fwaggle closed 4 years ago
https://github.com/lxc/lxd/pull/6560/commits/7199afba981ece28b40d5230e832307f3b3e0823 in https://github.com/lxc/lxd/pull/6560 handles this type of races. So we've literally written a fix for this accidentally earlier today :)
3.19 will have a completely rewritten storage layer so any existing storage bug will most likely be gone, possibly replaced by new, different bugs (as tends to happen when replacing such a large piece of code).
ACK, so should I leave this open, or close it and see if the behaviour shows up again in 3.19?
I'll close it when I merge 6560
Required information
Distribution: Ubuntu Distribution version: Bionic 18.04 The output of "lxc info":
Issue description
I think this is probably two bugs, but I don't have any idea how to reproduce the first, I'll just include it as it's important to the setup:
Occasionally, it seems an
lxc delete <container>
can fail. The ZFS dataset is destroyed, the only thing left is an empty dataset undersnapshots
, but the container remains present in LXD's database in the "STOPPED" state. In most cases a subsequentlxc delete <container>
cleans things up without issues.However lately we've had a further issue (the one this issue is about) where the further
lxc delete <container>
fails as well. I think this is because the dataset is destroyed, and unmounted, but LXD is dropping a backup.yml file in the directory for the container. I think (I have not checked the code) that LXD doesn't check if this directory is empty, it only checks if the dataset is unmounted, then tries to unlink the directory, which fails because it's not empty.It'd be great if, until the former issue is tracked down (working on it), LXD gracefully handled this situation... because at the moment with this issue there's no way LXD can recover on its own and someone has to shell in, check everything is correct (the container really doesn't exist any more), then remove the file and re-issue the delete command.
Any ideas on how to track down the first issue would be appreciated too, but I'll keep trying to figure it out.
Steps to reproduce
I don't really have good steps to reproduce (can't work out how to get into the first situation or I'd file a bug for that too), but here's the flow on an affected server:
Information to attach
I don't think any of this information is relevant, there's no container logs or anything because the container is deleted. Let me know if that assumption is incorrect.