Open holmanb opened 5 months ago
@tomponline thanks for brainstorming this issue earlier, I've collected some additional information to ease debugging. I discovered a couple of more things that are relevant:
1) Like you suspected, immediately after the lxc start
fails, lxc list
shows the state RUNNING
.
2) This does not reproduce when running lxc stop
manually rather than a guest-initiated shutdown
The following script reproduces the error while collecting lxc monitor logs.
For me this fails reliability in about 5 seconds.
#!/bin/sh
NAME=me
LOG=./reproduced.log
# stop if started
lxc stop $NAME 2> /dev/null || true
while true; do
# monitor, wipe log if repro failed
(lxc monitor me --pretty > $LOG) &
echo "starting instance"
lxc start $NAME
# wait until dbus is available (required for shutdown to work)
lxc exec $NAME -- sh -c 'while [ ! -S /run/dbus/system_bus_socket ]; do sleep 0.1; done'
# shutdown from the host
echo "shutting down the instance"
lxc exec $NAME -- shutdown -H now
while true; do
status=$(lxc list $NAME -cs --format csv)
if [ $status = STOPPED ] ; then
# this will fail sometimes
lxc start $NAME
rc="$?"
if [ $rc ]; then
echo "lxd start failed"
rerun="$(lxc list -cs --format csv $NAME)"
echo "lxc state immediately after failure: $rerun"
exit $rc
fi
# no error, retry
echo "did not repro, retrying"
lxc stop $NAME
break
fi
done
done
And a monitor log using --pretty
:
Like you suspected, immediately after the
lxc start
fails,lxc list
shows the stateRUNNING
.This does not reproduce when running
lxc stop
manually rather than a guest-initiated shutdown
Perfect thanks, so that confirms my suspicion that there is a tiny duration of time where liblxc is reporting the container's state as stopped, before it has notified LXD that the guest has self-stopped which would trigger LXD's stop cleanup operation (which is then reporting the container's status as running).
Required information
Issue description
When an instance has recently shutdown and reports a state of
STOPPED
, an attempt to start the instance may fail with an error:The instance is already running
. I observe this on occasion while manually running commands, but I wasn't bothered too much by the behavior so I never bothered to report. I discovered a bug in our test framework that prevents us from running certain tests, so I'm reporting it now.Steps to reproduce
I see this in an integration test that fails intermittently, and I can reproduce this locally with the test running in a loop. A trivial reproducer could be made that launches an instance, shuts it back down (our test uses guest-initiated shutdown), then waits for
STOPPED
state before runninglxc start
.Information to attach
Note that messages about instance shutdown occur both before and after the start fails, yet the code that initiates the
lxc start
doesn't run untilSTOPPED
is the reported instance state.