SVL-PSU / crete-dev

CRETE under development
Other
58 stars 15 forks source link

VMNode deadloops when QEMU fails to start #30

Open moralismercatus opened 7 years ago

moralismercatus commented 7 years ago

Problem Statement When QEMU fails to start, an exception is thrown (https://github.com/SVL-PSU/crete-dev/blob/2124206a7dea46842582e1da43161b891a9465c7/lib/cluster/vm_node_fsm.cpp#L725). At this point, VMNode's recovery mechanism will attempt a recovery and try again to the same effect, and on indefinitely.

It should be noted that this deadloop is not encountered when QEMU terminates after VMNodeFSM has consumed a test case, because eventually test cases will be exhausted. In this case, however, no test case is consumed yet.

We have encountered this in two scenarios.

  1. The GUI for QEMU is having technical difficulties (such as can occur with Xming+Putty).
  2. The QEMU image has somehow been corrupted.

Solution One solution is to throw a special exception designating that it originated in starting the VM, and therefore recovery should not be attempted.

See https://github.com/moralismercatus/crete-dev/commit/13c966ad64b89fc9b4fafc83224a48f0f78f40ff In essence, I added a new exception VMNoRecoveryException that is thrown from start_vm which transitions to the Terminate state instead of the Error state. In this way, VMNode will not attempt to reboot the VM. It's not a complete solution because, while the deadloop no longer occurs, for some reason, CRETE does not terminate. Another issue here is that errors don't get propagated back to Dispatch with the Terminate state. A more thorough fix is needed.