Open soulshake opened 3 years ago
Bump? This seems even more important given that it doesn't seem like processes are always properly terminated when they've finished or have crashed. (I think @ericphanson and @haberdashPI encountered this recently? and I've observed other apparent cases as well.)
FYI my case was unrelated to safe-to-evict
(or julia_pod
; I wasn't using it), and was an issue in how I was running the code (I wasn't releasing workers when they were done work, ref https://github.com/beacon-biosignals/K8sClusterManagers.jl/issues/87).
Huh, I guess I don't have a very clear idea about where K8sClusterManagers.jl ends and julia_pod
begins.
IIUC: if safe-to-evict
is not set, then one pod that's not using many resources might still get evicted during scale-down in some cases, like if some CPU usage threshold isn't being met... but I'm not 100% sure on the details.
On the other hand, if it's set to false
, then it will prevent a scale-down, even if the pod isn't doing anything.
julia_pod
is basically a big shell script to build a docker image, spin up a pod, and sync logs and files between the two, so you can interactively work on a pod. It doesn't require a local Julia install (but it dumps you in a Julia session on the pod you spin up) and doesn't use K8sClusterManager, but it installs it on the pod for you to use there if you want.
K8sClusterManagers lets you spin up more pods from a pod in a Julia-friendly way (i.e. using the Distributed stdlib which was made for managing remote workers etc, though often on a university HPC cluster rather than a cloud or k8s cluster). It can be used interactively or non-interactively, from a julia_pod
-powered session or from any other Julia session that happens to be running on a k8s pod. I often spin up a "manager" pod just using kubectl create ...
(no julia_pod
) and have that run a script that uses K8sClusterManagers to spin up more pods and send them work to do, etc.
As a followup to https://github.com/beacon-biosignals/julia_pod/pull/12 I'd like to suggest that the
cluster-autoscaler.kubernetes.io/safe-to-evict
annotation be set to"true"
by default (or unset), while giving the user the possibility to override it. This is because:As always, this choice is a trade-off worth discussing. If the monetary savings aren't worth the consequences of pods occasionally being evicted, then no change is needed. Unfortunately, I don't know of a straightforward way to get numbers that would tell us exactly how much compute is being wasted, but I can say that if people are setting
safe-to-evict
by default, then we will probably need to switch to smaller instance types on the projects cluster, so that a single pod can't cause a huge metal instance to run for days+, for example. (I regularly see stale pods hanging around, so I'm concerned that this could be :money_with_wings: :money_with_wings: :money_with_wings:)(See also here)