Open vlerenc opened 6 years ago
GKE enabled support for Pre-emptible VMs: https://cloud.google.com/kubernetes-engine/docs/concepts/preemptible-vm
Yes, I saw that quite some time ago. That's why I said in one of our syncs, we won't be the first anymore. It really does make a lot of sense, too. On the other hand, our priorities are right. We know we like to have it eventually, but we can't do everything at the same time.
Funny, today I even saw this (thanks @afritzler): https://cloudplatform.googleblog.com/2018/06/Cloud-TPU-now-offers-preemptible-pricing-and-global-availability.html
Maybe we can beat GKE with preemptible TPUs in Kubernetes clusters then? ;-) Just kidding, but TPU support is definitely also interesting and somehow different from how AWS handles GPU support (that already works, because MCM doesn't care, but TPUs must be assigned, @afritzler and @rfranzke told me a couple of days ago).
Once we have the time to work on this one (GKE and others support that, too - just saw it with Banzai as well), we might leverage this here: org:banzaicloud repo:spot-price-exporter.
@vlerenc Are there any updates regarding node pools with hotspot instances? We are looking forward for this!
Best, Samed
No, no update. So far, nobody even contacted us with the concrete need. You are the first. Most workload can't cope with that kind of infrastructure. Can you elaborate about your use case a bit?
cc @hardikdr @prashanth26 @amshuman-kr @juergenschneider
We are planning to have node pools with hotspot instances
Best, Samed
Have similar scenario. We need to dynamically create a pool of preemptible nodes for running divided cpu/gpu computing tasks (each task may last from 10 min to hours e.g.), expect to have some management like applying new node once any recalled.
This is currently enabled on AWS with - https://github.com/gardener/machine-controller-manager/pull/481. However, integration with gardener is yet to be done.
Hi, We're having similar scenario: we would like to run stage/test cluster using AWS Spot fleet to reduce costs. Is there any pland to deliver it in near future? Thanks in advance for response :)
It is certainly part of the roadmap, but we have a couple of other things (like improving draining of pods across cloud providers) in the plan before we can get to it.
Update on ability to support 'spot' instances across Azure and GCP? Will be useful towards cost savings.
There were quite some updates: E.g. AWS, Azure, and GCP now all support spot instances with dynamic prices (Azure and GCP deprecated their old models in favour of the new ones that are all called spot VMs). GCP doesn't support a threshold though, which is less than optimal (you can always look up the price though and act accordingly). Grace periods vary (AWS 120s
, Azure and GCP 30s
), but all notify and we could use that for immediate drain.
I also looked up auto-scaling groups: Now they all support multiple zones, but only AWS and Azure support mixing on-demand and spot instances. AWS' feature seems strange though, because different than Azure and GCP, the spot price may go beyond even the on-demand price. When the user sets a limit, e.g. at the regular on-demand price, AWS won't add capacity and you are left with the on-demand baseline, but Azure fulfils the request, capped at the on-demand price, so you get your machines still. That's at least how I understood the docs.
Rebalancing is another open point, e.g. never
, always
, grace_period
and/or cost_gap
maybe?
Stories
Motivation
Money, sure, but also some form of chaos monkey that should help train the application developers that all resources will eventually fail.
Acceptance Criteria
spot bid/price
, most likely)Remarks
Looks like Bosh had the same idea (well, everybody can if they have cattle VMs).
Enhancement/Implementation Proposal (optional)
Ideally, link to EP, e.g. a GEP in Gardener (https://github.com/gardener/gardener/tree/master/docs/proposals), alternatively prose here.
Challenges
352