Open cristim opened 7 years ago
Nevermind, I was misunderstanding how the termination works. It is only a two minute grace period before shutdown.
@deinspanjer I don't think I understand this fully, please explain a bit more.
The last hour worth of costs is simply subtracted from the total cost of running that spot instance, this is a billing thing which we don't really need to care about.
The replacement instance will be launched only after the termination notification was received, which is 2 minutes before the outbid spot instance is terminated.
I don't think we need to store anything, we can implement it by firing another event that launches the function with some parameters, and everything should be handled by the logic implemented in the function's code.
Yep, this sounds exactly right, I was just working off of very wrong assumptions in my initial question. :)
I'm not able to contribute to this project at the moment, but I am very interested in it and if things get a little more sane at work, I would be happy to help out with pieces of this.
Considering the recent exchanges, I allowed myself to edit your issue to add a note relative to the run frequency - simply in case someone else is implementing it.
@xlr-8 I don't think we need to change/consider the frequency, this could be implemented as another event generator for the Lambda function, basically the component running on the terminating instance would immediately call the Lambda function and tell it to detach the current instance and launch a new spot instance for that group.
We would need to create a new trigger that can run the function: maybe a REST endpoint implemented with an API Gateway so we don't need any additional IAM permissions, but maybe we can also do it with an SNS topic, this needs to be investigated.
But at the end of the day, in many cases the new billing-per-second feature makes this additional complexity harder to justify. The additional cost savings would be relatively small, it's just that we'd have less workload transitions before the group converges the configuration back to spot.
This is now relatively easy to do using the infrastructure we already have in place to listen for instance terminations. I'll work on this next.
Once done it should also help with #156, #284, #332 and #343
Feature idea
There should be an option to handle the spot termination signal from an agent component running on the spot instances and use it to replace it with another spot instance without running any temporary on-demand instances.
The terminated spot instance would be decoupled from the group and a new spot instance would be launched and added to the group to compensate for the drop in capacity.
This should be configurable on a per-group level using a dedicated tag.
Note: One thing to take into consideration is that the termination notice is 2min prior to the termination, which means that autospotting has to run at least every two minutes - but might miss some - or every minute. This is already the case in Cloudformation, but not in Terraform.