ai2cm / fv3config

Manipulate FV3GFS run directories
Apache License 2.0
1 stars 0 forks source link

Add node pool taint toleration to kubernetes submission #43

Closed frodre closed 4 years ago

frodre commented 4 years ago

This PR adds a toleration to the kubernetes job submission process. The toleration is keyed to allow the jobs to run on the large nodes that exist on the ml-cluster-dev kubernetes cluster. The need for a taint/toleration procedure is to keep specific kube-system pods off of the large nodes for proper node spin-down after completion. Prior to this, some of the pods would stay up indefinitely, which would be needlessly expensive.

I considered formalizing this as a configuration argument in KubernetesConfig for letting the user specify multiple tolerations, but since we only have one simulation cluster currently and since that cluster (with the taint) will change relatively infrequently, I don't think it's necessary yet.

frodre commented 4 years ago

Update terraform documentation and automate VM kubectl configuration for proxy cluster access.