There is a current problem of dozens of small server pods (e.g. 2 CPU each) stranded on larger host instances (e.g. 96 CPU). This underutilisation of oversized instances leads to an avoidable ~50% cost overhead.
The issue arises anytime that, during the 5-10 min after a privileged user finishes using a large server and before the cluster shuts down the corresponding EC2 instance, a normal small server pod can be scheduled onto the vacated host. More new pods can keep getting started on this host, blocking release for weeks.
The solution is to taint the larger hosts in a way that standard pods don't tolerate, or at least label hosts consistently and have singleuser pods exert a strong affinity for tailored node sizes.
There is a current problem of dozens of small server pods (e.g. 2 CPU each) stranded on larger host instances (e.g. 96 CPU). This underutilisation of oversized instances leads to an avoidable ~50% cost overhead.
The issue arises anytime that, during the 5-10 min after a privileged user finishes using a large server and before the cluster shuts down the corresponding EC2 instance, a normal small server pod can be scheduled onto the vacated host. More new pods can keep getting started on this host, blocking release for weeks.
The solution is to taint the larger hosts in a way that standard pods don't tolerate, or at least label hosts consistently and have singleuser pods exert a strong affinity for tailored node sizes.