Closed simondeziel closed 1 year ago
What storage pool is this, I've not been able to recreate on ZFS loop backed on NVME.
If I run lxc stop -f t
in a loop in one window and then launch a fresh container t
in another I see:
Error: Not Found
Error: Not Found
Error: Not Found
Error: Not Found
Error: Not Found
Error: Not Found
Error: Not Found
Error: Not Found
Error: Not Found
Error: Not Found
Error: Not Found
Error: Not Found
Error: The instance is already stopped
Error: The instance is already stopped
Error: The instance is already stopped
Error: The instance is already stopped
Error: The instance is already stopped
Error: The instance is already stopped
Error: The instance is already stopped
Error: The instance is already stopped
Error: The instance is already stopped
Error: The instance is already stopped
Error: The instance is already stopped
Error: The instance is already stopped
Error: The instance is already stopped
Error: The instance is already stopped
Error: The instance is already stopped
Error: The instance is already stopped
Error: The instance is already stopped
Error: Instance is busy running a "start" operation
Try `lxc info --show-log t` for more info
Error: The instance is already stopped
Error: The instance is already stopped
Error: The instance is already stopped
Error: The instance is already stopped
Error: The instance is already stopped
Error: The instance is already stopped
Which is what I would expect to see.
It's on a zpool backed by a NVME partition.
$ lxc storage list
+---------+--------+---------+-------------+---------+---------+
| NAME | DRIVER | SOURCE | DESCRIPTION | USED BY | STATE |
+---------+--------+---------+-------------+---------+---------+
| default | zfs | default | | 13 | CREATED |
+---------+--------+---------+-------------+---------+---------+
$ lxc storage show default
config:
source: default
volatile.initial_source: /dev/nvme0n1p6
zfs.pool_name: default
description: ""
name: default
driver: zfs
used_by:
- /1.0/images/39c55b257dc40e812cc1431e04eb1c883b03c349b67fbffb3300d76e42c36176
- /1.0/images/45e0432a28176ae29a29869cd667f87f633c00ef819804ecfb721f82c5acc995?project=charmcraft
- /1.0/images/6abc292fe3f02ad2218127926cda5dba31c6f2f0f21fc929157a2c7051c16f17
- /1.0/images/ae2a19ee7bc7f3b21cf7e529c51c5138ae1b99cbd0186265726c4265c1827303
...
- /1.0/profiles/default
- /1.0/profiles/default?project=charmcraft
status: Created
locations:
- none
Must be extremely fast as I cannot reproduce on my NVME ZFS partition either :(
Extremely fast outta be relative as the machine is from 2016 ;)
Extremely fast outta be relative as the machine is from 2016 ;)
That new ;)
I can theorise that there is a very small race potential here:
Called from Start()
:
https://github.com/lxc/lxd/blob/master/lxd/instance/drivers/driver_lxc.go#L2355-L2369
and Stop()
:
https://github.com/lxc/lxd/blob/master/lxd/instance/drivers/driver_lxc.go#L2614-L2624
Such that both Start and Stop both think there is no existing operation and both call Create.
Although the lock in Create
should still catch it:
Which is what I see in my tests:
Error: Instance is busy running a "start" operation
I'm happy to run any debug binary you feed me or give you SSH access (to my museum piece) if that's more convenient ;)
Can you get me the debug log output when you see this happening please.
Not sure, it is related, as all attention here goes toward storage. Sounds strange, but an immediate stopping or deleting instance after creation failed in my case, as long as instance was struggling to obtain IP from DHCP server.
I cannot reproduce after snap set lxd daemon.debug=true
and systemctl restart snap.lxd.daemon
:/
After unset daemon.debug
plus lxd restart, I can no longer reproduce it.
Whatever it was, I can no longer reproduce it. I will reopen if it happens again. Thanks Tom.
I just got into a state where this happens constistently. This happened after i was creating => starting => destroying containers in an automated testsuite. I'm not sure what gets me into this state, but restarting the LXD daemon "fixes" this issue.
$> lxc launch testing testing && lxc delete -f testing
Creating testing
Starting testing
Error: Stopping the instance failed: Failed unmounting instance: In use
It will show this error, but the instance will still be deleted.
Sometimes i can see this error in the logs, maybe it's of use:
Every 1.0s: lxc info testing --show-log balance: Thu Sep 8 23:03:02 2022
Name: testing
Status: STOPPED
Type: container
Architecture: x86_64
Created: 2022/09/08 23:02 CEST
Last Used: 2022/09/08 23:02 CEST
Log:
lxc 20220908210259.562 ERROR af_unix - ../src/lxc/af_unix.c:lxc_abstract_unix_recv_fds_iov:218 - Connection reset by peer - Failed to receive response
lxc 20220908210259.562 ERROR commands - ../src/lxc/commands.c:lxc_cmd_rsp_recv_fds:128 - Failed to receive file descriptors for command "get_state"
I'm running arch with with LXD community/lxd 5.5-1
with a dir
storage pool.
I've also noticed that LXD is not cleaning up network interfaces anymore when i get into this state.
Annoyingly i still haven't found a reliable way to get into this state, it seems to be related with calls to the /1.0/instances/{name}/state endpoint but i'm not sure. I will investigate further tomorrow.
I can now reproduce this, it seems to be caused by starting an instance while at the same time executing some other operation on it?
I'm testing this with a default profile that only contains a root disk.
This executes succesfully:
~ lxc init images:alpine/edge c && lxc start c; lxc exec c hostname; lxc delete -f c
Creating c
The instance you are starting doesn't have any network attached to it.
To create a new network, use: lxc network create
To attach a network to an instance, use: lxc network attach
c
But if i execute:
~ lxc init images:alpine/edge c && lxc start c & lxc exec c hostname; lxc delete -f c
Creating c
The instance you are starting doesn't have any network attached to it.
To create a new network, use: lxc network create
To attach a network to an instance, use: lxc network attach
[1] 2384
Error: Instance is not running
~ Error: Failed to run: /usr/bin/lxd forkstart c /var/lib/lxd/containers /var/log/lxd/c/lxc.conf: exit status 1
Try `lxc info --show-log c` for more info
[1] + 2384 exit 1 lxc start c
Then after this, when i execute the first command, i get this Failed unmounting instance: In use
error:
~ lxc init images:alpine/edge c && lxc start c; lxc exec c hostname; lxc delete -f c
Creating c
The instance you are starting doesn't have any network attached to it.
To create a new network, use: lxc network create
To attach a network to an instance, use: lxc network attach
c
Error: Stopping the instance failed: Failed unmounting instance: In use
The c
container logs:
Name: c
Status: STOPPED
Type: container
Architecture: x86_64
Created: 2022/09/09 16:09 CEST
Last Used: 2022/09/09 16:09 CEST
Log:
lxc c 20220909140914.685 WARN attach - ../src/lxc/attach.c:get_attach_context:477 - No security context eceived
Thats helpful thanks, and would explain why the mount is being left over as it would be triggered by exec
when the instance wasn't running. I'll try and reproduce this and fix.
Working on this now.
Is this reproducible on LXD 5.6 or LXD 5.0.1?
I'm running arch with with LXD
community/lxd 5.5-1
with adir
storage pool.
Ah just seen.
OK I've reproduced this now.
Required information
On Ubuntu 20.04.4 with LXD from snap (4.24 rev 22674)
Issue description
Creating a container and immediately stopping or deleting it fails:
Similar issue when deleting:
After such failure, a subsequent
lxc stop
orlxc delete
works.