Open edlerd opened 1 month ago
This seems expected, the container started ok, but then wasnt able to boot due to the process limit and stopped.
I think that is not true, the start operation fails
edlerd@EDLERD-T14:~$ lxc start robust-haddock ; lxc operation list
^[[AError: Failed to run: /snap/lxd/current/bin/lxd forkstart robust-haddock /var/snap/lxd/common/lxd/containers /var/snap/lxd/common/lxd/logs/robust-haddock/lxc.conf: exit status 1
Try `lxc info --show-log robust-haddock` for more info
+--------------------------------------+------+-------------------+---------+------------+----------------------+
| ID | TYPE | DESCRIPTION | STATUS | CANCELABLE | CREATED |
+--------------------------------------+------+-------------------+---------+------------+----------------------+
| e70b39e2-19f8-4ba9-94b7-a5e295f6c854 | TASK | Starting instance | FAILURE | NO | 2024/09/04 13:27 UTC |
+--------------------------------------+------+-------------------+---------+------------+----------------------+
edlerd@EDLERD-T14:~$ lxc operation show e70b39e2-19f8-4ba9-94b7-a5e295f6c854
id: e70b39e2-19f8-4ba9-94b7-a5e295f6c854
class: task
description: Starting instance
created_at: 2024-09-04T15:27:56.555845781+02:00
updated_at: 2024-09-04T15:27:56.555845781+02:00
status: Failure
status_code: 400
resources:
instances:
- /1.0/instances/robust-haddock
metadata: {}
may_cancel: false
err: 'Failed to run: /snap/lxd/current/bin/lxd forkstart robust-haddock /var/snap/lxd/common/lxd/containers
/var/snap/lxd/common/lxd/logs/robust-haddock/lxc.conf: exit status 1'
location: none
Yeah, its a bit tricky, because when a start operation is running LXD reports the container is running, so it only reports it as stopped once the start operation fails.
This is similar to https://github.com/canonical/lxd/issues/13453 (although different situation) in that the running status of an instance can change during the invocation of a single start/stop command.
I think it might be better to update the status of the instance at the end of an operation. Contrary to the current implementation that is changing the status when the operation gets created. In my mind that should solve both issues.
Another approach is to introduce "stopping", "starting", etc statuses. At the beginning of the operation the status is set to such an in-flux status. Once the operation finishes, depending on if the operation failed or succeeded, set the status to "stopped" or "running".
What do you think?
The "...ing" statuses isn't a new idea, see https://github.com/canonical/lxd/issues/10625 and has its own complexities.
I think it might be better to update the status of the instance at the end of an operation. Contrary to the current implementation that is changing the status when the operation gets created. In my mind that should solve both issues.
One thing to note here is that there isn't a DB entry for instance status, its always calculated live, so there is no such thing as "update the status".
We can see the current logic here:
https://github.com/canonical/lxd/blob/main/lxd/instance/drivers/driver_lxc.go#L8171-L8200
Interestingly this part:
https://github.com/canonical/lxd/blob/main/lxd/instance/drivers/driver_lxc.go#L8175-L8177
Seems to suggest that if the operationlock (not the same as the API operation) is in the state of starting, then the status is returned as stopped (and vice versa for stopping) - which is similar to what you suggest.
So perhaps the start operationlock is completing, the lxc container is effectively started, lxc list
reports it as running and then it stops with a failure before the API operation completes.
Will need to look into it in more detail...
Thanks for the code pointers, I reproduced this with debug logging. See below for details, I marked the lock lines with a >
.
I think the operation lock works while the instance is being started. On the start error, the lock is released, and another lock gets created shortly after for marking the instance as stopped. I think this 2nd lock is responsible for marking the instance as "running" (keeping in mind the logic you pointed at above). We can see the 2nd lock takes 1 second until it gets released, I assume this is exactly the time when the status is reported as running.
This 2nd lock should probably have special semantics to be "on instance start error handling". So the logic you pointed at can calculate the status correctly as "stopped" in this case.
Likewise, on an instance stop error, the logic could get special semantics from the lock that is handling the stopping error. Thus, report the instance as "running" if an error occurs on stop.
DEBUG [2024-09-05T12:07:34+02:00] Start started instance=u4 instanceType=container project=default stateful=false
INFO [2024-09-05T12:07:34+02:00] Starting instance action=start created="2024-09-05 10:05:00.133727164 +0000 UTC" ephemeral=false instance=u4 instanceType=container project=default stateful=false used="2024-09-05 10:05:03.240495977 +0000 UTC"
>DEBUG [2024-09-05T12:07:34+02:00] Instance operation lock created action=start instance=u4 project=default reusable=false
DEBUG [2024-09-05T12:07:34+02:00] Handling API request ip=@ method=GET protocol=unix url=/1.0/operations/734822c8-2785-4708-b3b6-15ec87061638 username=edlerd
DEBUG [2024-09-05T12:07:34+02:00] MountInstance started driver=dir instance=u4 pool=local-storage project=default
DEBUG [2024-09-05T12:07:34+02:00] MountInstance finished driver=dir instance=u4 pool=local-storage project=default
DEBUG [2024-09-05T12:07:34+02:00] Starting device device=root instance=u4 instanceType=container project=default type=disk
DEBUG [2024-09-05T12:07:34+02:00] UpdateInstanceBackupFile started driver=dir instance=u4 pool=local-storage project=default
DEBUG [2024-09-05T12:07:34+02:00] Skipping unmount as in use driver=dir pool=local-storage refCount=1 volName=u4
DEBUG [2024-09-05T12:07:34+02:00] UpdateInstanceBackupFile finished driver=dir instance=u4 pool=local-storage project=default
DEBUG [2024-09-05T12:07:34+02:00] Handling API request ip=@ method=GET protocol=unix url="/internal/containers/u4/onstart?project=default" username=root
DEBUG [2024-09-05T12:07:34+02:00] Start finished instance=u4 instanceType=container project=default stateful=false
>DEBUG [2024-09-05T12:07:34+02:00] Instance operation lock finished action=start err="Failed to run: /home/edlerd/go/bin/lxd forkstart u4 /var/lib/lxd/containers /var/log/lxd/u4/lxc.conf: exit status 1" instance=u4 project=default reusable=false
ERROR [2024-09-05T12:07:34+02:00] Failed starting instance action=start created="2024-09-05 10:05:00.133727164 +0000 UTC" ephemeral=false instance=u4 instanceType=container project=default stateful=false used="2024-09-05 10:05:03.240495977 +0000 UTC"
DEBUG [2024-09-05T12:07:34+02:00] Failure for operation class=task description="Starting instance" err="Failed to run: /home/edlerd/go/bin/lxd forkstart u4 /var/lib/lxd/containers /var/log/lxd/u4/lxc.conf: exit status 1" operation=734822c8-2785-4708-b3b6-15ec87061638 project=default
DEBUG [2024-09-05T12:07:34+02:00] Event listener server handler stopped listener=dab8b617-4b72-4685-9925-bbc69b9420a3 local=/var/lib/lxd/unix.socket remote=@
DEBUG [2024-09-05T12:07:34+02:00] Handling API request ip=@ method=GET protocol=unix url="/internal/containers/u4/onstopns?netns=%2Fproc%2F82656%2Ffd%2F4&project=default&target=stop" username=root
>DEBUG [2024-09-05T12:07:34+02:00] Instance operation lock created action=stop instance=u4 project=default reusable=false
DEBUG [2024-09-05T12:07:34+02:00] Instance initiated stop action=stop instance=u4 instanceType=container project=default
DEBUG [2024-09-05T12:07:35+02:00] Handling API request ip=@ method=GET protocol=unix url="/internal/containers/u4/onstop?project=default&target=stop" username=root
>DEBUG [2024-09-05T12:07:35+02:00] Instance operation lock inherited for stop action=stop instance=u4 instanceType=container project=default
DEBUG [2024-09-05T12:07:35+02:00] Instance stopped, cleaning up instance=u4 instanceType=container project=default
DEBUG [2024-09-05T12:07:35+02:00] Stopping device device=root instance=u4 instanceType=container project=default type=disk
DEBUG [2024-09-05T12:07:35+02:00] UnmountInstance started driver=dir instance=u4 pool=local-storage project=default
DEBUG [2024-09-05T12:07:35+02:00] UnmountInstance finished driver=dir instance=u4 pool=local-storage project=default
INFO [2024-09-05T12:07:35+02:00] Shut down instance action=stop created="2024-09-05 10:05:00.133727164 +0000 UTC" ephemeral=false instance=u4 instanceType=container project=default stateful=false used="2024-09-05 10:07:34.499904415 +0000 UTC"
>DEBUG [2024-09-05T12:07:35+02:00] Instance operation lock finished action=stop err="<nil>" instance=u4 project=default reusable=false
DEBUG [2024-09-05T12:07:34+02:00] Start finished instance=u4 instanceType=container project=default stateful=false
DEBUG [2024-09-05T12:07:34+02:00] Instance operation lock finished action=start err="Failed to run: /home/edlerd/go/bin/lxd forkstart u4 /var/lib/lxd/containers /var/log/lxd/u4/lxc.conf: exit status 1" instance=u4 project=default reusable=false
ERROR [2024-09-05T12:07:34+02:00] Failed starting instance action=start created="2024-09-05 10:05:00.133727164 +0000 UTC" ephemeral=false instance=u4 instanceType=container project=default stateful=false used="2024-09-05 10:05:03.240495977 +0000 UTC"
DEBUG [2024-09-05T12:07:34+02:00] Failure for operation
These lines are the key part in my view.
But also this line is important:
DEBUG [2024-09-05T12:07:34+02:00] Instance initiated stop action=stop instance=u4 instanceType=container project=default
As originally thought, its the lxc instance itself which stops, not LXD asking for it to stop, so it did "start" and then stops later on.
But also this line is important:
DEBUG [2024-09-05T12:07:34+02:00] Instance initiated stop action=stop instance=u4 instanceType=container project=default
As originally thought, its the lxc instance itself which stops, not LXD asking for it to stop, so it did "start" and then stops later on.
And its that instance initiated stop that causes the stop operationlock to be created, which in turn causes lxc list
to report it as running.
DEBUG [2024-09-05T12:07:34+02:00] Instance operation lock created action=stop instance=u4 project=default reusable=false
So somehow we need to resolve the conflict between the instance apparently failing to start:
ERROR [2024-09-05T12:07:34+02:00] Failed starting instance action=start created="2024-09-05 10:05:00.133727164 +0000 UTC" ephemeral=false instance=u4 instanceType=container project=default stateful=false used="2024-09-05 10:05:03.240495977 +0000 UTC"
With the reality that it did start and got far enough to trigger the stop hook when it did finally stop:
DEBUG [2024-09-05T12:07:34+02:00] Handling API request ip=@ method=GET protocol=unix url="/internal/containers/u4/onstopns?netns=%2Fproc%2F82656%2Ffd%2F4&project=default&target=stop" username=root
With the reality that it did start and got far enough to trigger the stop hook when it did finally stop:
DEBUG [2024-09-05T12:07:34+02:00] Handling API request ip=@ method=GET protocol=unix url="/internal/containers/u4/onstopns?netns=%2Fproc%2F82656%2Ffd%2F4&project=default&target=stop" username=root
What is triggering this API call?
I think we could add a flag to this call saying this is on error handling, so the lock gets that flag and the status calculation can react to this flag.
What is triggering this API call?
liblxc - but it cant tell why its happening, or whether its happening during start up or later
I see, thanks for the explanation. Now I get why it is so tricky with the lower level barrier in between.
Another idea: We introduce a global list of instances with failures on start or stop or alike. The error handler on stop/start/etc. puts an entry into that list. The entries live shortly (say they get removed automatically after 10 seconds). An entry in this list overrides the instance status calculation. The internal endpoint to stop or start will clear the entry in that list.
After contemplating on this for half a day:
As far as the LXD-UI is concerned, there is a much simpler solution. We can defer the clearing of the cache to refetch the instance status after an error by some seconds. This is anyway an async operation, so not a problem if we refresh a bit later instead of immediately. We'll surface the error message right away to the user and refresh shortly after. With the delay, we avoid hitting this "reporting running, while the instance is just being stopped" bug.
Required information
Issue description
When an instance is being started and the start operation fails, the instance is reported as
RUNNING
for a short amount of time. This is a problem for the UI, where we re-fetch the instance after the failure and the user will see the stopped instance in the wrong status.Steps to reproduce
config.limits.processes
to2
, so the instance fails to startError: Failed to run: /snap/lxd/current/bin/lxd forkstart robust-haddock /var/snap/lxd/common/lxd/containers /var/snap/lxd/common/lxd/logs/robust-haddock/lxc.conf: exit status 1 Try
lxc info --show-log robust-haddock
for more info | NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS | | robust-haddock | RUNNING | | | CONTAINER | 0 |lxc start robust-haddock ; lxc ls | grep rob ; sleep 1 ; lxc ls | grep rob
Error: Failed to run: /snap/lxd/current/bin/lxd forkstart robust-haddock /var/snap/lxd/common/lxd/containers /var/snap/lxd/common/lxd/logs/robust-haddock/lxc.conf: exit status 1 Try
lxc info --show-log robust-haddock
for more info | robust-haddock | RUNNING | | | CONTAINER | 0 | | robust-haddock | STOPPED | | | CONTAINER | 0 |