Closed janmuennich closed 2 months ago
@simondeziel please can you see if this is still occurring on 5.0/edge?
@simondeziel did you get chance to test this one yet?
If its an issue I'd like to get the fix into LXD 5.0.3
@simondeziel did you get chance to test this one yet?
I'll get back to you by EOD.
I'm afraid this will require some more time as I first need to get a testing environment using Ceph as storage backend.
@tomponline using 5.0/stable
(!= 5.0/edge
) with the same snap revision (24322
) I could not reproduce the issue.
@janmuennich here's what I did to try and reproduce your issue:
I setup a physical machine, created 3 VMs with 8 cores, 40G of RAM and an extra disk of 75G. I setup microceph
to use the extra disk in each VMs then clustered LXD. From there, I created 3 instances that were automatically spread between cluster members. I then did a rolling reboot by first lxc cluster evacuate
, then rebooting then lxc cluster restore
it back.
LXD moved the instances off and the node being evacuated and back when restored.
Failing to reproduce the issue I then tried a more aggressive scenario when I'd kill one of the cluster member cause its instance to go into ERROR state. I then lxc move --target
the instance to one of the 2 remaining cluster node and started it. After that, I started the killed cluster member and even then, it behaved fine and didn't attempt to start the instance that was migrated off of it while it was dead.
Thanks for trying to reproduce the issue!
Just to clarify, I didn't use lxc cluster evacuate
but used our own script that moves all instances on a machine with lxc move
to another spare one. So the affected machine was completely empty of any instances with an up-to-date database.
This is the procedure we use regularly and the issue occured only once that time. So I guess it's not easy to reproduce :(
@janmuennich you replied just before I tore down my test env so I was able to test without the lxc cluster evacuate
but sure enough, my few attempts didn't reproduce the issue :/
It probably makes sense then to close this issue for now until someone else encounters the same. We set lxc profile set default boot.autostart false
though to prevent this from happening again in production.
I may have also met this issue in lxd 5.11. What I do is: One node remove snap and lxd because of my misoperation. After I reinstall them, this node try to start all instances in the cluster. After this time, many instances' rootfs broken (including which is running before). I just checked my lxc config, but boot.autostart is not set. About 1/5 rootfs(ext4) of my instances is broken. However, I don't know whether it is the root cause of rootfs broken, just suspect it.
We'll close this issue as we were unsuccessful in reproducing it. If someone has a reliable reproducer or anything that could help debug it, please re-open or create a new one. Thanks.
Required information
Issue description
For a routine reboot of a cluster member, all containers on that member were moved to another cluster member. After the reboot, the server tried to start all containers in the cluster alphabetically, even though they were already running on other cluster members. After starting about 10 containers, LXD had a panic.
After another reboot, everything was running fine again with no attempt to start any containers.
The issue is critical, since the Ceph volumes were double-mounted which resulted in corrupted file systems (although I was able to restore recent snapshots).
How could this happen?
LXD didn't log anything relevant in
lxd.log
. Output fromsyslog
: