Open muratmaga opened 3 weeks ago
@jcfr can we implement something on the instances (much like your session script), that sends home a signal that is up and ready (like a boot script). Only after that we consider the action (create/unshelve) completed. If we don't receive the signal within the timeout period, it deletes the instance and recreates.
The main problem is not so much that the instance doesn't launch, but the workflow stops and we need to manually intervene.
Obviously, we don't want to create an infinite loop either (e.g., hypervisor problems). So maybe try 3-4 times then give up and notify explicitly (at the minimum label should be something like errored state or problem. I think currently it simply says offline.
If an action (creation, unshelving, shelving) times out, then often the instance ends up in errored state in the openstack, and requires manual intervention (which is usually deleting and then recreating).
Instance setup times seems to vary drastically during the day, based on the load on JS2. Sometimes within 2 minutes an instance is online from a fresh /create command, and sometimes /unshelve takes more than 10 minutes and then time outs.
Not sure what the most robust solution would be but couple options to explore:
I am open to other suggestions as well...