Closed loomis closed 5 years ago
The problem is a race condition between processes using the tgt-admin --update
command. This command uses the first unallocated target ID to call the lower-level command tgtadm
. Multiple processes can get the same target ID (tid) causing one invocation to fail.
The simple solution is to retry the command on failure. On a working cloud, the following script was added in /etc/sbin/tgt-admin-retry.sh
:
#!/bin/bash
function retry {
nTrys=0
maxTrys=3
status=256
until [ $status == 0 ] ; do
/usr/sbin/tgt-admin $@
status=$?
nTrys=$(($nTrys + 1))
if [ $nTrys -gt $maxTrys ] ; then
echo "Number of re-trys exceeded. Exit code: $status"
exit $status
fi
if [ $status != 0 ] ; then
echo "Failed (exit code $status)... retry $nTrys"
sleep 2
fi
done
}
retry $@
This also required the following changes in the LVMBackend.py
script:
/usr/sbin/tgt-admin
to /usr/sbin/tgt-admin-retry.sh
'reload_iscsi_snap':'.*',
to the success_msg_pattern
variable. This has been tested by submitting 40 VMs concurrently. Without the fix between 2-5 failures were seen. With the fix no failures appear.
A real solution for retries needs to be incorporated into the backend scripts.
I deployed this temporary solution on our StratusLab v13.05 cloud and then I ran 40 VMs simultaneously. Very few VM failed (instead of many!) by reaching the maximum retry. But it's easy to fix by adjusting the number of retries or the sleeping time.
The backend script for LVM causes intermittent failures when creating virtual machines with the error message:
in the pdisk log on the server.