Open cpaelzer opened 4 months ago
Does 5.21/stable work for you?
We test on Noble (https://github.com/canonical/lxd-ci/actions/runs/9986795035) before releasing and daily.
This is the problem line:
time="2024-07-18T09:40:31+02:00" level=error msg="Unable to run feature checks during QEMU initialization: open /tmp/1373261747: no such file or directory"
I would think it was this call that is failing:
Suggesting that LXD doesn't have access to /tmp
inside the snap's mount namespace.
I actually found that launching a VM will only re-show the message But not re-probe and re-trigger the underlying issue. But runnung sudo systemctl restart snap.lxd.daemon will make it re-probe and re-fail according to the logs below.
This is expected behaviour, feature checks are only done at start up, not on every launch/start.
Just tried a reproducer here:
lxc launch ubuntu-daily:noble v1 --vm -c limits.memory=2GiB
lxc exec v1 -- snap install lxd --channel=latest/stable
lxc exec v1 -- uname -a
Linux v1 6.8.0-36-generic #36-Ubuntu SMP PREEMPT_DYNAMIC Mon Jun 10 10:49:14 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
lxc exec v1 -- snap list
Name Version Rev Tracking Publisher Notes
core22 20240408 1380 latest/stable canonical✓ base
lxd 6.1-0d4d89b 29469 latest/stable canonical✓ -
snapd 2.63 21759 latest/stable canonical✓ snapd
lxc exec v1 -- lxd init --auto
lxc exec v1 -- lxc launch ubuntu-daily:j j-vm --ephemeral --vm
lxc exec v1 -- lxc list
+------+---------+------------------------+------------------------------------------------+-----------------------------+-----------+
| NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS |
+------+---------+------------------------+------------------------------------------------+-----------------------------+-----------+
| j-vm | RUNNING | 10.102.95.156 (enp5s0) | fd42:7bde:aa78:7f4:216:3eff:fe3a:f81e (enp5s0) | VIRTUAL-MACHINE (EPHEMERAL) | 0 |
+------+---------+------------------------+------------------------------------------------+-----------------------------+-----------+
So looks like working generally, but something is different on your host preventing /tmp from being accessible at LXD start time.
Does this occur every time you reload the snap?
Did you try doing snap stop lxd
, then snap start lxd
?
Joint debugging session, thanks @tomponline
Checking /tmp in the namespace
$ sudo nsenter --mount=/run/snapd/ns/lxd.mnt --
$ touch /tmp/foo
touch: cannot touch '/tmp/foo': No such file or directory
Interesting, that is the same issue
$ findmnt /tmp/
TARGET SOURCE FSTYPE OPTIONS
/tmp /dev/nvme0n1p5[/tmp] ext4 rw,relatime,errors=remount-ro
/tmp /dev/nvme0n1p5[/tmp/snap-private-tmp/snap.lxd/tmp//deleted] ext4 rw,relatime,errors=remount-ro
So it is deleted, but why?
findmnt /tmp/
TARGET SOURCE FSTYPE OPTIONS
/tmp /dev/nvme0n1p5[/tmp] ext4 rw,relatime,errors=remount-ro
/tmp /dev/nvme0n1p5[/tmp/snap-private-tmp/snap.lxd/tmp//deleted] ext4 rw,relatime,errors=remount-ro
So it is deleted from the namespace POV
ll /tmp/snap-private-tmp
total 52
drwx------ 8 root root 4096 Jul 1 08:32 ./
drwxrwxrwt 40 root root 20480 Jul 18 11:02 ../
drwx------ 3 root root 4096 Mai 21 08:11 snap.canonical-livepatch/
drwx------ 3 root root 4096 Mai 23 20:51 snap.element-desktop/
drwx------ 3 root root 4096 Mai 23 20:51 snap.firefox/
drwx------ 3 root root 4096 Mai 21 08:11 snap.ncspot/
drwx------ 3 root root 4096 Jun 27 08:50 snap.ppa-dev-tools/
drwx------ 3 root root 4096 Jun 19 15:06 snap.ustriage/
And indeed on the host it is no there?
Satrt/Stopping to reset
sudo snap stop lxd
+ sudo snap start lxd
but that did not set up the paths.
Question for now - shouldn't snapd set those up?
Things that do not re-eastablish the mount
sudo snap stop lxd + sudo snap start lxd
sudo systemctl restart snapd.service
What works is manually restoring the mount
cd /tmp/snap-private-tmp/
cp -a snap.ustriage snap.lxd
sudo nsenter --mount=/run/snapd/ns/lxd.mnt --
)umount /tmp/
mount -o bind /tmp/snap-private-tmp/snap.lxd/tmp/ /tmp/
That gets us back to
$ findmnt /tmp/
TARGET SOURCE FSTYPE OPTIONS
/tmp /dev/nvme0n1p5[/tmp] ext4 rw,relatime,errors=remount-ro
/tmp /dev/nvme0n1p5[/tmp/snap-private-tmp/snap.lxd/tmp] ext4 rw,relatime,errors=remount-ro
And from there we can run sudo systemctl reload snap.lxd.daemon.service
to recheck capabilities and now guest VMs can be started again.
Update from the future (to be complete in one place), later discussions showed that the following restores the system
snap disable lxd snap enable lxd
Mystery: what lost it in the first place?
To state things we checked due to later discussions in the snappy channel:
We've been advised that snap disable lxd
might have allowed us to remove the old mount.
Maybe same issue? #13746
Yes looks very similar, and so we know its not a 6.1 issue now too thanks.
Using snap disable lxd
followed by snap enable lxd
after recreating the missing directory confirmed to fix the snap mount in https://github.com/canonical/lxd/issues/13746#issuecomment-2237539249
Required information
Issue description
Starting a VM does no more work, what I get is this
This has worked throughout the week and before multiple times, but it stopped working "out of a sudden". Same kernel and environment used to work for almost two months
Steps to reproduce
Start a VM like
lxc launch ubuntu-daily:j j-vm --ephemeral --vm
I actually found that launching a VM will only re-show the message But not re-probe and re-trigger the underlying issue. But runnung
sudo systemctl restart snap.lxd.daemon
will make it re-probe and re-fail according to the logs below.Information to attach
I've done some debugging to try to help you to help me :-)
My first suspicion is an unattended snap upgrade, and indeed I see:
Apt upgrade should not influence it much, but for completeness the list of what happened there in the last two days since it was still working.
I've found in discussions that you'd usually check for kvm and vsock devices, but that looks good to me.
Furthermore I ran check kernel
To the issue seems more some call to qemu to check capabilities that fails. And below we see it seems to have some trouble with tmp directories.
In the LXD log I see:
Dmesg at the time shows nothing - as it bails out before trying to start. But restarting LXD will trigger a re-probing causing the same again.
While doing that I have dmesg that seems related to LXD; but no e.g. new apparmor denial. Most of the output is the recycling of the containers that I have still up.
lxc info NAME --show-log
andlxc config show NAME --expanded
do not apply, as the container never gets to exist.Output of the client with --debug
Output
lxc monitor
while reproducing the issueAs mentioned I was suspicious on the recent auto-upgrade I looked after these
So going one back didn't work.