MorphoCloud / MorphoCloudWorkflow

Reusable GitHub Workflows to manage JetStream2 backed on-demand virtual machines
BSD 2-Clause "Simplified" License
1 stars 1 forks source link

Make the actions more robuts #50

Open muratmaga opened 3 weeks ago

muratmaga commented 3 weeks ago

If an action (creation, unshelving, shelving) times out, then often the instance ends up in errored state in the openstack, and requires manual intervention (which is usually deleting and then recreating).

Instance setup times seems to vary drastically during the day, based on the load on JS2. Sometimes within 2 minutes an instance is online from a fresh /create command, and sometimes /unshelve takes more than 10 minutes and then time outs.

Not sure what the most robust solution would be but couple options to explore:

  1. Increase the time out to some really large value (20-30 minutes). But then if there are other actions waiting in the queue, I don't know what the consequence of this would be.
  2. Modify the actions such that they do not run from beginning to end, but they initiate and then periodically check the task completion while running other actions (if there are any in the queue) meanwhile. I am not sure if this is even possible with GH actions, and seems too complicated to implement.
  3. Give the users authority to delete the instance. So if they see an error state, they keep trying deleting and recreating the instance. Of course there is no guarantee that the next one will finish faster.

I am open to other suggestions as well...

muratmaga commented 2 weeks ago

@jcfr can we implement something on the instances (much like your session script), that sends home a signal that is up and ready (like a boot script). Only after that we consider the action (create/unshelve) completed. If we don't receive the signal within the timeout period, it deletes the instance and recreates.

The main problem is not so much that the instance doesn't launch, but the workflow stops and we need to manually intervene.

Obviously, we don't want to create an infinite loop either (e.g., hypervisor problems). So maybe try 3-4 times then give up and notify explicitly (at the minimum label should be something like errored state or problem. I think currently it simply says offline.