canonical / lxd

Powerful system container and virtual machine manager
https://canonical.com/lxd
GNU Affero General Public License v3.0
4.37k stars 930 forks source link

Shortly after an instance start failure, the instance is reported as RUNNING instead of STOPPED #14042

Open edlerd opened 1 month ago

edlerd commented 1 month ago

Required information

Issue description

When an instance is being started and the start operation fails, the instance is reported as RUNNING for a short amount of time. This is a problem for the UI, where we re-fetch the instance after the failure and the user will see the stopped instance in the wrong status.

Steps to reproduce

  1. Create an instance, set config.limits.processes to 2, so the instance fails to start
  2. Start the instance and immediately fetch the instance list. This results in output of an error from the start command, and report the instance as running in the listing of the second command:
    
    lxc start robust-haddock ; lxc ls

Error: Failed to run: /snap/lxd/current/bin/lxd forkstart robust-haddock /var/snap/lxd/common/lxd/containers /var/snap/lxd/common/lxd/logs/robust-haddock/lxc.conf: exit status 1 Try lxc info --show-log robust-haddock for more info | NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS | | robust-haddock | RUNNING | | | CONTAINER | 0 |

3. Expected would be that the instance is reported with state `STOPPED`. This is the case if we rerun `lxc ls` a second later, but shortly after the failure it should also be consistent:

lxc start robust-haddock ; lxc ls | grep rob ; sleep 1 ; lxc ls | grep rob

Error: Failed to run: /snap/lxd/current/bin/lxd forkstart robust-haddock /var/snap/lxd/common/lxd/containers /var/snap/lxd/common/lxd/logs/robust-haddock/lxc.conf: exit status 1 Try lxc info --show-log robust-haddock for more info | robust-haddock | RUNNING | | | CONTAINER | 0 | | robust-haddock | STOPPED | | | CONTAINER | 0 |

tomponline commented 1 month ago

This seems expected, the container started ok, but then wasnt able to boot due to the process limit and stopped.

edlerd commented 1 month ago

I think that is not true, the start operation fails

edlerd@EDLERD-T14:~$ lxc start robust-haddock ; lxc operation list
^[[AError: Failed to run: /snap/lxd/current/bin/lxd forkstart robust-haddock /var/snap/lxd/common/lxd/containers /var/snap/lxd/common/lxd/logs/robust-haddock/lxc.conf: exit status 1
Try `lxc info --show-log robust-haddock` for more info
+--------------------------------------+------+-------------------+---------+------------+----------------------+
|                  ID                  | TYPE |    DESCRIPTION    | STATUS  | CANCELABLE |       CREATED        |
+--------------------------------------+------+-------------------+---------+------------+----------------------+
| e70b39e2-19f8-4ba9-94b7-a5e295f6c854 | TASK | Starting instance | FAILURE | NO         | 2024/09/04 13:27 UTC |
+--------------------------------------+------+-------------------+---------+------------+----------------------+
edlerd@EDLERD-T14:~$ lxc operation show e70b39e2-19f8-4ba9-94b7-a5e295f6c854
id: e70b39e2-19f8-4ba9-94b7-a5e295f6c854
class: task
description: Starting instance
created_at: 2024-09-04T15:27:56.555845781+02:00
updated_at: 2024-09-04T15:27:56.555845781+02:00
status: Failure
status_code: 400
resources:
  instances:
  - /1.0/instances/robust-haddock
metadata: {}
may_cancel: false
err: 'Failed to run: /snap/lxd/current/bin/lxd forkstart robust-haddock /var/snap/lxd/common/lxd/containers
  /var/snap/lxd/common/lxd/logs/robust-haddock/lxc.conf: exit status 1'
location: none
tomponline commented 1 month ago

Yeah, its a bit tricky, because when a start operation is running LXD reports the container is running, so it only reports it as stopped once the start operation fails.

This is similar to https://github.com/canonical/lxd/issues/13453 (although different situation) in that the running status of an instance can change during the invocation of a single start/stop command.

edlerd commented 1 month ago

I think it might be better to update the status of the instance at the end of an operation. Contrary to the current implementation that is changing the status when the operation gets created. In my mind that should solve both issues.

Another approach is to introduce "stopping", "starting", etc statuses. At the beginning of the operation the status is set to such an in-flux status. Once the operation finishes, depending on if the operation failed or succeeded, set the status to "stopped" or "running".

What do you think?

tomponline commented 1 month ago

The "...ing" statuses isn't a new idea, see https://github.com/canonical/lxd/issues/10625 and has its own complexities.

I think it might be better to update the status of the instance at the end of an operation. Contrary to the current implementation that is changing the status when the operation gets created. In my mind that should solve both issues.

One thing to note here is that there isn't a DB entry for instance status, its always calculated live, so there is no such thing as "update the status".

We can see the current logic here:

https://github.com/canonical/lxd/blob/main/lxd/instance/drivers/driver_lxc.go#L8171-L8200

Interestingly this part:

https://github.com/canonical/lxd/blob/main/lxd/instance/drivers/driver_lxc.go#L8175-L8177

Seems to suggest that if the operationlock (not the same as the API operation) is in the state of starting, then the status is returned as stopped (and vice versa for stopping) - which is similar to what you suggest.

So perhaps the start operationlock is completing, the lxc container is effectively started, lxc list reports it as running and then it stops with a failure before the API operation completes.

Will need to look into it in more detail...

edlerd commented 1 month ago

Thanks for the code pointers, I reproduced this with debug logging. See below for details, I marked the lock lines with a >.

I think the operation lock works while the instance is being started. On the start error, the lock is released, and another lock gets created shortly after for marking the instance as stopped. I think this 2nd lock is responsible for marking the instance as "running" (keeping in mind the logic you pointed at above). We can see the 2nd lock takes 1 second until it gets released, I assume this is exactly the time when the status is reported as running.

This 2nd lock should probably have special semantics to be "on instance start error handling". So the logic you pointed at can calculate the status correctly as "stopped" in this case.

Likewise, on an instance stop error, the logic could get special semantics from the lock that is handling the stopping error. Thus, report the instance as "running" if an error occurs on stop.

DEBUG  [2024-09-05T12:07:34+02:00] Start started                                 instance=u4 instanceType=container project=default stateful=false
INFO   [2024-09-05T12:07:34+02:00] Starting instance                             action=start created="2024-09-05 10:05:00.133727164 +0000 UTC" ephemeral=false instance=u4 instanceType=container project=default stateful=false used="2024-09-05 10:05:03.240495977 +0000 UTC"
>DEBUG  [2024-09-05T12:07:34+02:00] Instance operation lock created               action=start instance=u4 project=default reusable=false
DEBUG  [2024-09-05T12:07:34+02:00] Handling API request                          ip=@ method=GET protocol=unix url=/1.0/operations/734822c8-2785-4708-b3b6-15ec87061638 username=edlerd
DEBUG  [2024-09-05T12:07:34+02:00] MountInstance started                         driver=dir instance=u4 pool=local-storage project=default
DEBUG  [2024-09-05T12:07:34+02:00] MountInstance finished                        driver=dir instance=u4 pool=local-storage project=default
DEBUG  [2024-09-05T12:07:34+02:00] Starting device                               device=root instance=u4 instanceType=container project=default type=disk
DEBUG  [2024-09-05T12:07:34+02:00] UpdateInstanceBackupFile started              driver=dir instance=u4 pool=local-storage project=default
DEBUG  [2024-09-05T12:07:34+02:00] Skipping unmount as in use                    driver=dir pool=local-storage refCount=1 volName=u4
DEBUG  [2024-09-05T12:07:34+02:00] UpdateInstanceBackupFile finished             driver=dir instance=u4 pool=local-storage project=default
DEBUG  [2024-09-05T12:07:34+02:00] Handling API request                          ip=@ method=GET protocol=unix url="/internal/containers/u4/onstart?project=default" username=root
DEBUG  [2024-09-05T12:07:34+02:00] Start finished                                instance=u4 instanceType=container project=default stateful=false
>DEBUG  [2024-09-05T12:07:34+02:00] Instance operation lock finished              action=start err="Failed to run: /home/edlerd/go/bin/lxd forkstart u4 /var/lib/lxd/containers /var/log/lxd/u4/lxc.conf: exit status 1" instance=u4 project=default reusable=false
ERROR  [2024-09-05T12:07:34+02:00] Failed starting instance                      action=start created="2024-09-05 10:05:00.133727164 +0000 UTC" ephemeral=false instance=u4 instanceType=container project=default stateful=false used="2024-09-05 10:05:03.240495977 +0000 UTC"
DEBUG  [2024-09-05T12:07:34+02:00] Failure for operation                         class=task description="Starting instance" err="Failed to run: /home/edlerd/go/bin/lxd forkstart u4 /var/lib/lxd/containers /var/log/lxd/u4/lxc.conf: exit status 1" operation=734822c8-2785-4708-b3b6-15ec87061638 project=default
DEBUG  [2024-09-05T12:07:34+02:00] Event listener server handler stopped         listener=dab8b617-4b72-4685-9925-bbc69b9420a3 local=/var/lib/lxd/unix.socket remote=@
DEBUG  [2024-09-05T12:07:34+02:00] Handling API request                          ip=@ method=GET protocol=unix url="/internal/containers/u4/onstopns?netns=%2Fproc%2F82656%2Ffd%2F4&project=default&target=stop" username=root
>DEBUG  [2024-09-05T12:07:34+02:00] Instance operation lock created               action=stop instance=u4 project=default reusable=false
DEBUG  [2024-09-05T12:07:34+02:00] Instance initiated stop                       action=stop instance=u4 instanceType=container project=default
DEBUG  [2024-09-05T12:07:35+02:00] Handling API request                          ip=@ method=GET protocol=unix url="/internal/containers/u4/onstop?project=default&target=stop" username=root
>DEBUG  [2024-09-05T12:07:35+02:00] Instance operation lock inherited for stop    action=stop instance=u4 instanceType=container project=default
DEBUG  [2024-09-05T12:07:35+02:00] Instance stopped, cleaning up                 instance=u4 instanceType=container project=default
DEBUG  [2024-09-05T12:07:35+02:00] Stopping device                               device=root instance=u4 instanceType=container project=default type=disk
DEBUG  [2024-09-05T12:07:35+02:00] UnmountInstance started                       driver=dir instance=u4 pool=local-storage project=default
DEBUG  [2024-09-05T12:07:35+02:00] UnmountInstance finished                      driver=dir instance=u4 pool=local-storage project=default
INFO   [2024-09-05T12:07:35+02:00] Shut down instance                            action=stop created="2024-09-05 10:05:00.133727164 +0000 UTC" ephemeral=false instance=u4 instanceType=container project=default stateful=false used="2024-09-05 10:07:34.499904415 +0000 UTC"
>DEBUG  [2024-09-05T12:07:35+02:00] Instance operation lock finished              action=stop err="<nil>" instance=u4 project=default reusable=false
tomponline commented 1 month ago
DEBUG  [2024-09-05T12:07:34+02:00] Start finished                                instance=u4 instanceType=container project=default stateful=false
DEBUG  [2024-09-05T12:07:34+02:00] Instance operation lock finished              action=start err="Failed to run: /home/edlerd/go/bin/lxd forkstart u4 /var/lib/lxd/containers /var/log/lxd/u4/lxc.conf: exit status 1" instance=u4 project=default reusable=false
ERROR  [2024-09-05T12:07:34+02:00] Failed starting instance                      action=start created="2024-09-05 10:05:00.133727164 +0000 UTC" ephemeral=false instance=u4 instanceType=container project=default stateful=false used="2024-09-05 10:05:03.240495977 +0000 UTC"
DEBUG  [2024-09-05T12:07:34+02:00] Failure for operation

These lines are the key part in my view.

tomponline commented 1 month ago

But also this line is important:

DEBUG  [2024-09-05T12:07:34+02:00] Instance initiated stop                       action=stop instance=u4 instanceType=container project=default

As originally thought, its the lxc instance itself which stops, not LXD asking for it to stop, so it did "start" and then stops later on.

tomponline commented 1 month ago

But also this line is important:

DEBUG  [2024-09-05T12:07:34+02:00] Instance initiated stop                       action=stop instance=u4 instanceType=container project=default

As originally thought, its the lxc instance itself which stops, not LXD asking for it to stop, so it did "start" and then stops later on.

And its that instance initiated stop that causes the stop operationlock to be created, which in turn causes lxc list to report it as running.

DEBUG  [2024-09-05T12:07:34+02:00] Instance operation lock created               action=stop instance=u4 project=default reusable=false
tomponline commented 1 month ago

So somehow we need to resolve the conflict between the instance apparently failing to start:

ERROR  [2024-09-05T12:07:34+02:00] Failed starting instance                      action=start created="2024-09-05 10:05:00.133727164 +0000 UTC" ephemeral=false instance=u4 instanceType=container project=default stateful=false used="2024-09-05 10:05:03.240495977 +0000 UTC"

With the reality that it did start and got far enough to trigger the stop hook when it did finally stop:

DEBUG  [2024-09-05T12:07:34+02:00] Handling API request                          ip=@ method=GET protocol=unix url="/internal/containers/u4/onstopns?netns=%2Fproc%2F82656%2Ffd%2F4&project=default&target=stop" username=root
edlerd commented 1 month ago

With the reality that it did start and got far enough to trigger the stop hook when it did finally stop:

DEBUG  [2024-09-05T12:07:34+02:00] Handling API request                          ip=@ method=GET protocol=unix url="/internal/containers/u4/onstopns?netns=%2Fproc%2F82656%2Ffd%2F4&project=default&target=stop" username=root

What is triggering this API call?

I think we could add a flag to this call saying this is on error handling, so the lock gets that flag and the status calculation can react to this flag.

tomponline commented 1 month ago

What is triggering this API call?

liblxc - but it cant tell why its happening, or whether its happening during start up or later

edlerd commented 1 month ago

I see, thanks for the explanation. Now I get why it is so tricky with the lower level barrier in between.

Another idea: We introduce a global list of instances with failures on start or stop or alike. The error handler on stop/start/etc. puts an entry into that list. The entries live shortly (say they get removed automatically after 10 seconds). An entry in this list overrides the instance status calculation. The internal endpoint to stop or start will clear the entry in that list.

edlerd commented 1 month ago

After contemplating on this for half a day:

As far as the LXD-UI is concerned, there is a much simpler solution. We can defer the clearing of the cache to refetch the instance status after an error by some seconds. This is anyway an async operation, so not a problem if we refresh a bit later instead of immediately. We'll surface the error message right away to the user and refresh shortly after. With the delay, we avoid hitting this "reporting running, while the instance is just being stopped" bug.