This PR adds a toleration to the kubernetes job submission process. The toleration is keyed to allow the jobs to run on the large nodes that exist on the ml-cluster-dev kubernetes cluster. The need for a taint/toleration procedure is to keep specific kube-system pods off of the large nodes for proper node spin-down after completion. Prior to this, some of the pods would stay up indefinitely, which would be needlessly expensive.
I considered formalizing this as a configuration argument in KubernetesConfig for letting the user specify multiple tolerations, but since we only have one simulation cluster currently and since that cluster (with the taint) will change relatively infrequently, I don't think it's necessary yet.
This PR adds a toleration to the kubernetes job submission process. The toleration is keyed to allow the jobs to run on the large nodes that exist on the
ml-cluster-dev
kubernetes cluster. The need for a taint/toleration procedure is to keep specifickube-system
pods off of the large nodes for proper node spin-down after completion. Prior to this, some of the pods would stay up indefinitely, which would be needlessly expensive.I considered formalizing this as a configuration argument in
KubernetesConfig
for letting the user specify multiple tolerations, but since we only have one simulation cluster currently and since that cluster (with the taint) will change relatively infrequently, I don't think it's necessary yet.