MorphoCloud / MorphoCloudWorkflow

Reusable GitHub Workflows to manage JetStream2 backed on-demand virtual machines
BSD 2-Clause "Simplified" License
1 stars 1 forks source link

Improper Success Reporting After Control Command Polling Timeout #20

Closed jcfr closed 1 month ago

jcfr commented 1 month ago

When the polling associated with control commands like shelve times out, subsequent commands may improperly report success.

For instance, the shelve command executed on the server timed out before the server status could update from ACTIVE to SHELVED_OFFLOADED.

This issue can be observed in the following links:

After the timeout, an attempt was made to unshelve the instance via the /unshelve comment. However, because the server status remained as ACTIVE due to the earlier timeout, the unshelve command incorrectly assumed success. This occurred because the workflow does not currently verify the "OS-EXT-STS:task_state" server property, which shows intermediate states during status transitions.

For more details on task states and server statuses, please refer to the following documentation:

Proposed Solution:

  1. Verify pending transitions before assuming a command's success, ensuring that the server status is correctly updated before proceeding with subsequent commands.
  2. Consider increasing the polling timeout for control commands.
jcfr commented 1 month ago

Fixed in https://github.com/MorphoCloud/MorphoCloudWorkflow/commit/e2bbe0cea143116dc707f4a686fce91ce2d1859d