StratusLab / client

Command Line Interface (CLI) for StratusLab cloud services
Apache License 2.0
2 stars 1 forks source link

VM creation with LVM backend fails intermittently #149

Closed loomis closed 5 years ago

loomis commented 10 years ago

The backend script for LVM causes intermittent failures when creating virtual machines with the error message:

<<<<<<<<<<
tgtadm: this target already exists
Command: tgtadm -C 0 --lld iscsi --op new --mode target --tid 502 -T iqn.2011-01.eu.stratuslab:5e9d71ab-0548-4451-b3a8-28beafffb0a4
exited with code: 22.
>>>>>>>>>>

in the pdisk log on the server.

loomis commented 10 years ago

The problem is a race condition between processes using the tgt-admin --update command. This command uses the first unallocated target ID to call the lower-level command tgtadm. Multiple processes can get the same target ID (tid) causing one invocation to fail.

The simple solution is to retry the command on failure. On a working cloud, the following script was added in /etc/sbin/tgt-admin-retry.sh:

#!/bin/bash 

function retry {
   nTrys=0
   maxTrys=3
   status=256
   until [ $status == 0 ] ; do
      /usr/sbin/tgt-admin $@
      status=$?
      nTrys=$(($nTrys + 1))
      if [ $nTrys -gt $maxTrys ] ; then
            echo "Number of re-trys exceeded. Exit code: $status"
            exit $status
      fi
      if [ $status != 0 ] ; then
            echo "Failed (exit code $status)... retry $nTrys"
            sleep 2
      fi
   done
}

retry $@

This also required the following changes in the LVMBackend.py script:

This has been tested by submitting 40 VMs concurrently. Without the fix between 2-5 failures were seen. With the fix no failures appear.

loomis commented 10 years ago

A real solution for retries needs to be incorporated into the backend scripts.

cgauthey commented 10 years ago

I deployed this temporary solution on our StratusLab v13.05 cloud and then I ran 40 VMs simultaneously. Very few VM failed (instead of many!) by reaching the maximum retry. But it's easy to fix by adjusting the number of retries or the sleeping time.