canonical / lxd

Powerful system container and virtual machine manager
https://canonical.com/lxd
GNU Affero General Public License v3.0
4.34k stars 930 forks source link

lxc start fails despite stopped state #13453

Open holmanb opened 5 months ago

holmanb commented 5 months ago

Required information

Issue description

When an instance has recently shutdown and reports a state of STOPPED, an attempt to start the instance may fail with an error: The instance is already running. I observe this on occasion while manually running commands, but I wasn't bothered too much by the behavior so I never bothered to report. I discovered a bug in our test framework that prevents us from running certain tests, so I'm reporting it now.

Steps to reproduce

I see this in an integration test that fails intermittently, and I can reproduce this locally with the test running in a loop. A trivial reproducer could be made that launches an instance, shuts it back down (our test uses guest-initiated shutdown), then waits for STOPPED state before running lxc start.

Information to attach

Note that messages about instance shutdown occur both before and after the start fails, yet the code that initiates the lxc start doesn't run until STOPPED is the reported instance state.

holmanb commented 5 months ago

@tomponline thanks for brainstorming this issue earlier, I've collected some additional information to ease debugging. I discovered a couple of more things that are relevant:

1) Like you suspected, immediately after the lxc start fails, lxc list shows the state RUNNING.

2) This does not reproduce when running lxc stop manually rather than a guest-initiated shutdown

The following script reproduces the error while collecting lxc monitor logs.

For me this fails reliability in about 5 seconds.

#!/bin/sh

NAME=me
LOG=./reproduced.log

# stop if started
lxc stop $NAME 2> /dev/null || true

while true; do

    # monitor, wipe log if repro failed
    (lxc monitor me --pretty > $LOG) &

    echo "starting instance"
    lxc start $NAME

    # wait until dbus is available (required for shutdown to work)
    lxc exec $NAME -- sh -c 'while [ ! -S /run/dbus/system_bus_socket ]; do sleep 0.1; done'

    # shutdown from the host
    echo "shutting down the instance"
    lxc exec $NAME -- shutdown -H now
    while true; do

        status=$(lxc list $NAME -cs --format csv)
        if [ $status = STOPPED ] ; then

            # this will fail sometimes
            lxc start $NAME
            rc="$?"
            if [ $rc ]; then
                echo "lxd start failed"
                rerun="$(lxc list -cs --format csv $NAME)"
                echo "lxc state immediately after failure: $rerun"
                exit $rc
            fi

            # no error, retry
            echo "did not repro, retrying"
            lxc stop $NAME
            break
        fi
    done
done
holmanb commented 5 months ago

And a monitor log using --pretty:

reproduced.log

tomponline commented 5 months ago
  • Like you suspected, immediately after the lxc start fails, lxc list shows the state RUNNING.

  • This does not reproduce when running lxc stop manually rather than a guest-initiated shutdown

Perfect thanks, so that confirms my suspicion that there is a tiny duration of time where liblxc is reporting the container's state as stopped, before it has notified LXD that the guest has self-stopped which would trigger LXD's stop cleanup operation (which is then reporting the container's status as running).