Cannot delete LXD zfs backed containers: dataset is busy

Kramerican commented 6 years ago

Minty fresh Ubuntu 18.04 system LXD v3.0.0 (latest from apt, how to get v3.0.1?)

Started seeing this beginning last week crop up arbitrarily across my infrastructure. Out of ~10 delete operations, I have seen this happen to 3 containers on 2 different systems.

~# lxc delete test1
Error: Failed to destroy ZFS filesystem: cannot destroy 'lxd/containers/test1': dataset is busy

~# lxc ls
+-------+---------+---------------------+-------------------------------+------------+-----------+
| NAME  |  STATE  |        IPV4         |             IPV6              |    TYPE    | SNAPSHOTS |
+-------+---------+---------------------+-------------------------------+------------+-----------+
| doxpl | RUNNING | 46.4.158.225 (eth0) | 2a01:4f8:221:1809::601 (eth0) | PERSISTENT | 0         |
+-------+---------+---------------------+-------------------------------+------------+-----------+
| test1 | STOPPED |                     |                               | PERSISTENT | 0         |
+-------+---------+---------------------+-------------------------------+------------+-----------+

Tried googling around a bit and I have tried the most common tips on figuring out what might be keeping the dataset busy: There are no snapshots or dependencies, dataset is unmounted i.e. zfs list reports

NAME                                                                          USED  AVAIL  REFER  MOUNTPOINT
lxd                                                                          3.51G   458G    24K  none
lxd/containers                                                               2.24G   458G    24K  none
lxd/containers/doxpl                                                         1.04G   766M  2.25G  /var/lib/lxd/storage-pools/lxd/containers/doxpl
lxd/containers/test1                                                         1.20G  6.80G  1.20G  none
lxd/custom                                                                     24K   458G    24K  none
lxd/deleted                                                                    24K   458G    24K  none
lxd/images                                                                   1.27G   458G    24K  none
lxd/images/7d4aa78fb18775e6c3aa2c8e5ffa6c88692791adda3e8735a835e0ba779204ec  1.27G   458G  1.27G  none
lxd/snapshots                                                                  24K   458G    24K  none

Could LXD still be holding the dataset? I see there are a number of zfs related fixes in v3.0.1 but I cannot do an apt upgrade to this version..?

Edit: issuing systemctl restart lxd does not resolve the issue, so maybe not lxd after all. Strange...

stgraber commented 6 years ago

It's most likely some other process that's forked the mount table and is now preventing LXD from unmounting the container...

You can run grep containers/test1 /proc/*/mountinfo to find out what process that is. You can then run nsenter -t <PID> -m -- umount /var/lib/lxd/storage-pools/lxd/containers/test1 to get rid of that mount, at which point lxc delete should work again...

Kramerican commented 6 years ago

You mean cat /proc/*/mountinfo | grep containers/test1 ? No hits ...

lxd delete still reports dataset is busy :(

Edit: grepping for the other, running, container results in lots of hits, so I think I have formatted that grep correctly. It seems there is nothing referencing test1 in proc/*/mountinfo. Any further ideas? :)

stgraber commented 6 years ago

Hmm, then it's not mounted anywhere visible which would likely make it a kernel bug... You can wait for a while hoping for the kernel to untangle whatever's going on or you can reboot the system which will fix it for sure...

Sorry I don't have a better answer for this.

Kramerican commented 6 years ago

@stgraber Oh dear Cthulhu, that's bad. That also explains why I'm seeing it across several systems, as I keep my servers in synch w/regards to kernel/os/package versions.

I just checked on one of my other systems, and there I have a dataset which I still cannot destroy even after 48+ hours. So it does not seem this will go away on its own. There it is also "invisible".

If you want access to the server Stéphane and poke around a bit, let me know. Otherwise I guess I'll just have to mitigate this manually (sigh) and update my kernels when I get the chance, and hope that resolves the issue.

PS: I am not used to seeing grep issued that way, your command was of course correctly formatted, I just assumed it didn't since I didn't get any hits #n00b

Kramerican commented 6 years ago

Should I report this as a bug somewhere, you think?

Kramerican commented 6 years ago

If others stumble on this issue: There is a workaround in that it is possible to rename the dataset. So if your container is stopped, you can do:

zfs rename lxd/containers/test1 lxd/containers/test1failed

After which you can issue

lxc delete test1

However you then still have this dataset hanging around, which you will need to clean up at a later date, i.e. after a reboot I suppose. This pretty much sucks! :D

stgraber commented 6 years ago

Yeah, that's pretty odd, I wonder what's keeping that active... You don't have any running zfs command for that dataset (zfs send, zfs snapshot, ...)?

Just run ps aux | grep test1 to be sure.

If not, then I'd be happy to take a look, see if anything stands out. There's a pretty good chance that it's a kernel bug, but we haven't seen reports of this before so it's intriguing.

stgraber commented 6 years ago

(Note that I'm on vacation in Europe this week so not quite as around as usual :))

Kramerican commented 6 years ago

Nope no zfs commands running. I have sent you access by mail :) - enjoy your vacation..!!

stgraber commented 6 years ago

Very weird. I poked around for a bit, eventually restarting the lxd process which was apparently enough to get zfs to unstick and I could then delete the dataset just fine.

Now that we know that kicking lxd apparently unsticks zfs, can you let me know if you have another machine with the same issue (or can cause the one I already have access to to run into it again)?

I'd like to see what LXD shows as open prior to being killed, then if just killing it is enough to make zfs happy and if not, then why would lxd starting again somehow unstick zfs.

FWIW, what I tried before restarting lxd was:

Look through all open file descriptors
Looked at all mounts on the system
Checked for any obvious internal zfs holds

None of which showed anything relevant...

stgraber commented 6 years ago

It could be an fd leak from a file that was read or written from the container by LXD and wasn't closed, but what's odd is that if it was the case, we should have seen an fd with the container path and there were none of them... Hopefully I can look at another instance of this problem and figure that part out.

stgraber commented 6 years ago

Marking incomplete for now, @Kramerican let me know when you have another affected system.

Kramerican commented 6 years ago

I sure will @stgraber I'm also on vacation and haven't had a chance to check if I can provoke this behaviour. I'll let you know.

stgraber commented 6 years ago

@Kramerican still on vacation? :)

Kramerican commented 6 years ago

@stgraber Yes until next week - but I will set some time aside then to try and force this behavior.

Kramerican commented 6 years ago

@stgraber I have had little luck in forcing this behavior, but this has been cropping up all over the shop these last few days.

I had written some fallback code in our tools which simply renames the dataset, so that lxc delete could be run. These datasets are still "stuck" and zfs refuses to delete them. I have not restarted lxd in order to delete them - is it enough for you to get access to one of these systems to diagnose further? In which case let me know and I'll give you access. Thanks..!

stgraber commented 6 years ago

@Kramerican yep, having access to one of the systems with such a stuck dataset should be enough to try to track down what LXD's got open that would explain the busy error.

Kramerican commented 6 years ago

@stgraber Excellent. Mail with details sent.

stgraber commented 6 years ago

@Kramerican so the only potential issue I'm seeing is a very large number of mapped /config files which is a leak that I believe has already been fixed with a combination of lxd and liblxc fix. Any chance you can upgrade your systems to 3.0.1 of both liblxc1 and lxd? Both have been available for about a week now.

stgraber commented 6 years ago

If it's an option at all, at least on your Ubuntu 18.04 systems, I'd recommend considering moving to the lxd snap (--channel=3.0/stable in your case) as that would get you a much faster turnaround for fixes than we can do with the deb package.

Kramerican commented 6 years ago

@stgraber Excellent. However on the system where you have access, I have apt upgrade hanging at 95% at Setting up lxd (3.0.1-0ubuntu1~18.04.1) ..

/var/log/lxd/lxd.log shows and entry which I think is responsible: lvl=eror msg="Failed to cleanly shutdown daemon: raft did not shutdown within 10s" t=2018-07-04T21:31:27+0200

Is raft a process I can kill? Suggestions on how to unstick the upgrade?

Kramerican commented 6 years ago

@stgraber Nevermind - it got unstuck after a while. Everything seems fine.

I will upgrade all systems and report back if the issue persists. Thanks..!

stgraber commented 6 years ago

@Kramerican pretty sure it got unstuck because I logged in and ran systemctl stop lxd lxd.socket to unblock things. Looks like the RAFT database is hitting a timeout at startup.

It's actually a bug that 3.0.1 fixes but if your database has too many transactions prior to the upgrade, it still fails to start. The trick to unstick it is to temporarily move it to a very fast tmpfs which I'm doing on that system now.

Kramerican commented 6 years ago

@stgraber ah yes I saw that lxc ls and other commands were not working. I won't mess around on that system anymore until you report back.

Series of commands which will help unstick lxd would be nice to have here, in case I see this happen on one of the other ~10 hosts I need to upgrade

stgraber commented 6 years ago

@Kramerican all done, that system is good to go.

If you hit the same problem, you'll need to:

systemctl stop lxd lxd.socket

This will unstick the update. Once the update is all done, run again (for good measure):

systemctl stop lxd lxd.socket

That should ensure that LXD is fully stopped (containers are still running fine though). Once that's done, do:

mv /var/lib/lxd/database /var/lib/lxd/database.good
mkdir /var/lib/lxd/database
mount -t tmpfs tmpfs /var/lib/lxd/database
cp -r /var/lib/lxd/database.good/* /var/lib/lxd/database
lxd --debug --group lxd

You'll see the daemon start, let it run until it hits "Done updating instance types" which is when it'll be ready for normal operation, then hit ctrl+c to stop it. Once done, do:

mkdir /var/lib/lxd/database.new
mv /var/lib/lxd/database/* /var/lib/lxd/database.new/
umount /var/lib/lxd/database
rmdir /var/lib/lxd/database
mv /var/lib/lxd/database.new /var/lib/lxd/database
chmod 700 /var/lib/lxd/database
systemctl start lxd

And you'll be back online with the newly compacted and much faster database.

stgraber commented 6 years ago

This is only needed on systems where LXD isn't able to load its database within the 10s timeout so hopefully a majority of your systems will not need this trick. Once LXD successfully starts once on 3.0.1, the database gets compacted automatically in the background as well as on exit to prevent this problem from ever occurring again.

Kramerican commented 6 years ago

@stgraber This is pure epic. Thanks so much, I'll get started cracks knuckles :)

stgraber commented 6 years ago

@Kramerican I've also deleted the two failed datasets on sisko, so restarting LXD did the trick to unstick zfs, now the question is whether we'll be seeing the issue re-appear with 3.0.1.

Kramerican commented 6 years ago

@stgraber ok so I completed the upgrade on all my hosts, only had to follow your steps here on one other system :+1:

In the process however it turns out that one of my systems was already all up to date with 3.0.1 and here the failure with a stuck dataset happened today

I have just sent a mail with details and access to the system

Kramerican commented 6 years ago

@stgraber Did you see my last message yesterday, where I'd found that this had actually already happened on a 3.0.1 system? You should have access details in your inbox :)

stgraber commented 6 years ago

@Kramerican Hi, I saw the comment and e-mail but haven't yet had time to look at that system.

Kramerican commented 6 years ago

@stgraber Can I assume that you are on track with this bug and it is in the process of being fixed? Let me know if you need the user account I set up for you, otherwise I'd like to nuke it from that system. Due diligence and all :)

stgraber commented 6 years ago

@Kramerican Hi, unfortunately no, I haven't had time to look into this yet as I've been having management meetings this week with limited time to look into LXD bugs.

Kramerican commented 6 years ago

@stgraber That is all good big buddy - I will leave that user account for now, so you can dig into that system whenever it suits you best.

Kramerican commented 6 years ago

@stgraber This is an on-going issue across most hosts in our infrastructure. I now have quite a few hanging datasets lying around, as this keeps cropping up. All systems are running LXD 3.0.1 on Bionic at the moment.

Hope you can find time to give this another look one of them days :) Please let me know if you need me to re-send access credentials

ma3252788 commented 6 years ago

And then ,I try to delete the lxdmcj ,It sent me a message: error: cannot open 'mcj-lxd-zfs': dataset does not exist

stgraber commented 6 years ago

@ma3252788 that's unrelated to the issue we're investigating here. Your error suggests that your entire zpool is somehow offline. Check with zpool status to see if it's offline due to data corruption or if it's entirely missing from ZFS (not imported).

melato commented 6 years ago

I have run into the same problem on two different systems, recently: Failed to destroy ZFS filesystem: cannot destroy 'z/lxd/containers/otp1': dataset is busy I'm on snap lxd version 3.6 Rev 9298.

snap list: lxd 3.6 9298 stable canonical✓ -

/etc/lsb-release: DISTRIB_DESCRIPTION="Ubuntu 18.04.1 LTS"

uname -a: Linux star 4.15.0-36-generic #39-Ubuntu SMP Mon Sep 24 16:19:09 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

I did the mountinfo search and did a ps -f on the resulting pids, which can be done in one line like this:

ps -f -p `grep containers/otp1 /proc/*/mountinfo | sed 's-^/proc/--' | sed 's-/.*--' | tr '\n' ',' | sed 's/,$//'`

The result was 26 processes of the kind: root 560 1 0 Oct15 ? 00:00:00 [lxc monitor] /var/snap/lxd/common/lxd/containers fil (where fil is a running container) plus this process: root 2356 1 0 Oct15 ? 00:00:05 lxcfs /var/snap/lxd/common/var/lib/lxcfs -p /var/snap/lxd/common/lxcfs.pid

The system has been running for 10 days, so Oct15 must be around boot time.

For the lxc monitor processes, I compared the 26 containers that I saw in ps with the list of the 31 running containers. The 5 missing containers were containers that I created yesterday or today. Two were new, 1 was renamed (lxc move), 1 was copied (lxc copy), and the 5th was the container that was copied from.

In both cases, the container that cannot be deleted is a container that I've been cloning frequently from another container, cycling through this pattern:

lxc copy --container-only otp otp1
lxc profile apply otp1 profile1,profile2
lxc start otp1
... use otp1 for a while
lxc stop otp1
lxc delete otp1

stgraber commented 6 years ago

@melato the issue from @Kramerican is a bit different because in his case he has no actual references being held for the mount.

Your case is somewhat more common, effectively the container was already mounted by the time lxcfs started (possibly because lxcfs crashed and restarted?) which meant lxcfs would get a copy of the mount into its mount namespace, holding a reference to it.

To unblock things, what you can do is use nsenter -t PID -m against the processes that you found has mount references, then unmount the path through that shell, once all references are gone, delete should work fine.

@brauner for the lxcfs part of this, any chance we can make lxcfs cleanup its mount namespace so it doesn't hold those refs?

The lxc monitor side is a bit harder because it's not itself maintaining a mount namespace, instead it's using whatever mount namespace LXD was running with at the time, effectively holding onto an old namespace as LXD get refreshed, I've not yet found a good way to avoid this issue, some of the options are:

Kernel fix for ZFS (we opened an issue quite a while back)
Make [lxc monitor] use an intermediate mount namespace that it cleans up after initialization to only reference what it absolutely needs
Have LXD detect an older mount namespace on startup and go clean it up or move the mounts over to the current namespace (preferred). We do that for a few of our mounts already so maybe there's a reasonably easy way to do it for all the storage pools.

melato commented 6 years ago

I was able to delete the container after doing the nsenter -p suggestion for the lxcfs process.

roosterfish commented 6 years ago

I can confirm the bug that @melato has discovered recently. In my case it's a cloned container too. The nsenter workaround fixed it.

blakedot commented 6 years ago

I've hit this "Failed to destroy ZFS filesystem" bug on LXD 3.2 on Alpine 3.8.1 as well. I'll try @stgraber's workarounds & report back

blakedot commented 6 years ago

It appears that my container may have gotten into this state because the underlying image expired & was removed from the image cache. ZFS was in a weird state where it said the dataset was busy, but it certainly wasn't mounted under anything accessible on the host.

I'll make a local copy of the image (debian stretch) before launch so hopefully the image won't expire (I need to check into that).

Luckily this system is not a critical production box, so a simple fix to delete the container was:

lxc config set brokencontainer boot.autostart 0
reboot the machine to eliminate all stale mounts
lxc delete brokencontainer

I'll chime in if it happens again...

blakedot commented 6 years ago

BTW this is Linux 4.14 ZFS v0.7.8-1 lxd 3.2 lxcfs 3.0.1 lxc 2.1.1

(so legacy cgroups mode)

For the record, after much wrangling, I've gotten LXC 3.0.2 to compile on Alpine. Haven't had a chance to test it yet.

ser commented 6 years ago

If others stumble on this issue: There is a workaround in that it is possible to rename the dataset. So if your container is stopped, you can do:

zfs rename lxd/containers/test1 lxd/containers/test1failed

After which you can issue

lxc delete test1

However you then still have this dataset hanging around, which you will need to clean up at a later date, i.e. after a reboot I suppose. This pretty much sucks! :D

In my case your workaround works perfectly and I am able to destroy dataset without a reboot after renaming it and deleting lxd container.

paride commented 5 years ago

Just ran into this issue running LXD 3.10 (snap), which restarted on February 12, one day after the package has been published on snapcraft.io. From lxd.log:

LXD 3.10 is starting in normal mode" path=/var/snap/lxd/common/lxd
[...]
Initializing global database"
Updating the LXD global schema. Backup made as \"global.bak\"

The problem is indeed bound to the 10s timeout, as it takes exactly 10s to lxc to print the "dataset is busy" error. Comment https://github.com/lxc/lxd/issues/4656#issuecomment-402550930 made me think that lxd >= 3.0.1 was able to compact the database in the background to avoid this issue, but apparently this is not the case, or the cause of the problem is different.

The machine where I'm experiencing the issue has 27 containers, but it is not heavily loaded. Containers could be removed reliaibly up to a few days ago. Is there any other useful information I can let you have? Thanks!

Cc: @blackboxsw

stgraber commented 5 years ago

@paride your issue is likely a different one and just has to do with snapd generating new mount namespaces which hold up mount table entries for ZFS.

For one such un-deletable stopped container, please do:

grep container/NAME /proc/*/mountinfo

This will hopefully show you some PIDs, those are the PIDs that are in a mount namespace where this container is still mounted.

You can then unblock this issue by doing:

nsenter -t PID -m -- umount /path/to/the/container/mountpoint

Doing this for any of the returned PID will usually make all the mount go away for all the other PIDs as they're likely to share the same mount namespace.

We landed code in the LXD snap a while back to avoid such issues by doing some very complicated dance in the various mount tables, this usually worked well, except that the most recent snapd release had an upgrade bug which made it so it would hide the old mount table, making our normal mitigation useless for some users... As far as I can tell, this bug got resolved in snapd so this shouldn't happen again with the next snapd release (crosses fingers)...

paride commented 5 years ago

Thanks @stgraber, I did try grepping /proc/*/mountinfo but nothing there matches the name of the container I'm trying to delete:

$ lxc delete paride-ubuntu-core-16
Error: Failed to destroy ZFS filesystem: cannot destroy 'zfs-lxd/containers/paride-ubuntu-core-16': dataset is busy
$ grep container/paride-ubuntu-core-16 /proc/*/mountinfo
$

stgraber commented 5 years ago

@paride oh, oops, try containers/paride-ubuntu-core-18

paride commented 5 years ago

@stgraber thanks, it worked.

canonical / lxd

Cannot delete LXD zfs backed containers: dataset is busy #4656