gabegorelick commented 4 years ago

Github issue

Issue type

Feature idea

Build number

master

Summary

By default, when AutoSpotting gets a CloudWatch Event signaling that a spot instance is due to be terminated soon (a "2 minute warning") it detaches the instance. This ensures that the ASG will start bringing up a replacement instance before the spot instance stops. The spot instance stays running (but unattached to the ASG) so that it can still (hopefully) do useful work in the meantime.

However, detaching the instance before its replacement is online means diminished capacity in the ASG. For example, if the spot instance is attached to a load balancer, then that unattached instance is not serving traffic, so you can potentially have an extended period with fewer usable instances.

When an ASG has a lifecycle rule, AutoSpotting terminates the instance instead of detaching it. But this can lead to similar downtime. AutoSpotting assumes that the lifecycle rules will block termination of the instance until new capacity is online. In practice, I think most lifecycle hooks simply drain work off the terminating instances. But if there isn't enough capacity to to shift that work onto existing instances, then you'd suffer downtime. I guess the lifecycle hook could launch a new instance, but that seems like it would cause a lot of problems (e.g. ASG updates that launch a new instance then terminate the old one wouldn't work).

All this can be extra dangerous if a significant portion of your spot instances get interrupted at the same time: AutoSpotting can terminate all your instances at once.

A similar issue happens when AutoSpotting detects the ratio of spot to on-demand instances is too high. It terminates a random spot instance and lets the ASG bring up a replacement afterwards.

Steps to reproduce

Have an ASG with spot instance
Somehow get spot instance interrupted
Notice ASG size is < desired count for the interval between spot instance warning and launch of on-demand instance

Expected results

AutoSpotting launches on-demand instance before terminating or detaching spot instance.

Actual results

AutoSpotting terminates or detaches spot instance before replacement is online.

gabegorelick commented 4 years ago

I guess the lifecycle hook could launch a new instance, but that seems like it would cause a lot of problems (e.g. ASG updates that launch a new instance then terminate the old one wouldn't work).

It may be possible to get this to work. I'm curious if anyone has done that.

cristim commented 4 years ago

Thanks for reporting this, I was actually contemplating to attempt launching a new spot instance with fallback to launching an on demand instance even of a different instance type (potentially more expensive) when handling the termination event. this would have to be the first one we attempt to replace with spot afterwards.

It should not be so hard to implement but I would like to have this done after merging the event based replacement and porting the spot termination handling to the model used for handling the other events.

The benefit of using this would be especially visible when handling ICE events which have been reported a few times in the past.

gabegorelick commented 4 years ago

I was actually contemplating to attempt launching a new spot instance with fallback to launching an on demand instance even of a different instance type (potentially more expensive) when handling the termination event.

Why would we do that instead of launching an on-demand instance and then letting a subsequent invocation of AutoSpotting switch it out for a spot-instance? Wouldn't that make it more likely that no instance is brought online in time? Or can we fairly quickly determine that the spot request fails and then fallback to on-demand well within 2 minutes?

cristim commented 4 years ago

Why would we do that instead of launching an on-demand instance and then letting a subsequent invocation of AutoSpotting switch it out for a spot-instance?

Reducing reduced capacity/downtime and churn.

Wouldn't that make it more likely that no instance is brought online in time? Or can we fairly quickly determine that the spot request fails and then fallback to on-demand well within 2 minutes?

The RunInstances API call that we use to launch spot instances fails fairly quickly with insufficient capacity, if I remember correctly it was a matter of seconds so we have time to iterate over multiple instance types

gabegorelick commented 4 years ago

The RunInstances API call that we use to launch spot instances fails fairly quickly with insufficient capacity, if I remember correctly it was a matter of seconds so we have time to iterate over multiple instance types

Awesome! If that's the case, then your approach makes sense.

Thanks.

LeanerCloud / AutoSpotting

AutoSpotting should bring replacement instance online before detaching or terminating spot instance #401

Github issue

Issue type

Build number

Summary

Steps to reproduce

Expected results

Actual results