alces-software / flight-appliance-templates

Orchestration templates for Alces Flight appliances
www.alces-flight.com
GNU Affero General Public License v3.0
1 stars 0 forks source link

Scaling group problems in a non-spot non-scaling cluster #10

Open vlj91 opened 7 years ago

vlj91 commented 7 years ago

From @ste78 on May 22, 2016 18:29

The CF templates still create a scaling group with a minimum of 1 when using a non-spot, non-scaling cluster. If you then shutdown all nodes (because you want to bring them back later) The scaling group panics and brings others, then randomly terminates one of the others. Ideally we want to do something to prevent this in a cluster where the user has requested no-scaling because the node should be seen as being non-ephemeral.

Copied from original issue: alces-software/flight-aws-marketplace#65

vlj91 commented 7 years ago

From @ste78 on May 22, 2016 18:53

hmm closer investigation shows this isnt the only problem. I now have min nodes set to 0 and the problem still happens. It looks like the 'instance is unhealthy' built in trigger fires when an instance is in 'stopped' state :( Cause:CauseAt 2016-05-22T18:47:40Z an instance was taken out of service in response to a EC2 health check indicating it has been terminated or stopped

vlj91 commented 7 years ago

From @ste78 on May 22, 2016 19:34

This is going to need to have a similar solution to #63. We're going to need a 'prepare cluster for shutdown script'. The manual process to put nodes into 'stopped' mode is to edit the autoscaling group, set all nodes to 'standby state' - this seems to pause the instance health checks - then you can shutdown -h on the instances. Bring them back later and set the autoscaling instances mode back to 'InService'