Open gabegorelick opened 3 years ago
Thanks for reporting this.
At the moment I don't have any time to look into this but by all means please try to test it and report back your findings, preferably in a pull request to update the documentation.
If you don't like the way AutoSpotting handles this, pull requests to change it for the better are always welcome 😁
The main difference seems to be that AutoSpotting launches OnDemand instances and then tries to replace them with spot instances, while Capacity Rebalancing seems to only attempt to launch spot instances. In theory, it's possible that AutoSpotting can do a better job at launching an OnDemand instance than Capacity Rebalancing can do in finding a spot instance, but it seems like AWS's service should be pretty good at finding spare capacity (feel free to chime in if anyone has empirical data on this).
The OnDemand instances are not launched by AutoSpotting, but by the ASG itself. When the event comes, (regardless if it's a termination of rebalancing event, as they're handled the same way) AutoSpotting will currently either:
1) proactively detach the terminating Spot instance from the ASG and leave it run outside the ASG for up to 14 minutes (we have a 15min Lambda timeout), then terminates it if it wasn't terminated by EC2 Spot. Spot will terminate the instance after 2 minutes if it was a termination notification, but rebalancing events may not always result in terminations, and that's why we terminate it ourselves.
Then the ASG will notice it runs with reduced capacity, and will attempt to launch an OnDemand instance to recover the desired capacity. Within seconds after launch, this new OnDemand instance will be replaced by a new Spot instance and terminated, so the new Spot instance is booting up inside the ASG.
or...
2) terminate the instance while it's still in the ASG, telling the ASG to replace it immediately with a new OnDemand instance, which will be replaced identically by AutoSpotting as it's mentioned at the end of option 1.
The default behavior depends if the ASG has Lifecycle Hooks configured:
There is also a configuration flag that can enforce either of the above behaviors regardless if the ASG has Lifecycle hooks or not, as you can see in the CloudFormation stack parameters:
TerminationNotificationAction:
AllowedValues:
- "auto"
- "detach"
- "terminate"
Default: "auto"
Are there any other differences between AutoSpotting and native autoscaling that should be documented?
Yes, the ASG won't run any temporary OnDemand capacity. It will first attempt to launch the replacement Spot instance, and only terminates the instance that received the rebalancing event after the new Spot instance is ready and passes the EC2/ELB health checks.
I've been working on a similar implementation in #475 but it's not ready yet. In addition, this will also fallback to OnDemand capacity with fallback across instance types if we failed to launch Spot across all the suitable Spot instance types from the AZ.
I'm looking for people who can help me test/refine #475 to get it merged.
Github issue
Issue type
Summary
As of https://github.com/AutoSpotting/AutoSpotting/issues/448, AutoSpotting now responds to instance rebalance recommendation notifications. But AWS also has a native solution called "EC2 Auto Scaling Capacity Rebalancing" that responds to these events in a similar manner as AutoSpotting. It would be nice to highlight the differences between these two solutions in AutoSpotting's docs.
How it works:
The main difference seems to be that AutoSpotting launches on-demand instances and then tries to replace them with spot instances, while Capacity Rebalancing seems to only attempt to launch spot instances. In theory, it's possible that AutoSpotting can do a better job at launching an on-demand instance than Capacity Rebalancing can do in finding a spot instance, but it seems like AWS's service should be pretty good at finding spare capacity (feel free to chime in if anyone has empirical data on this).
Are there any other differences between AutoSpotting and native autoscaling that should be documented?