Support for running nodes on Kubernetes

jvsoest commented 1 year ago

Which means having the node start pods/deployments (which is difficult with mounts (volume claims). Or to use a container/deployment/pod as a service for the whole duration of the learning job, using REST calls between node and job container instead of volume mounts.

dsmits commented 1 year ago

I guess this ties into the issue that a node needs to be online all the time right? You would like to use computational resources just for the duration of the task and then shut it down again.

jvsoest commented 12 months ago

Could be, but it is related to #7 as it would make deployment easier on various platforms. For example, UM has a Data Science Research Infrastructure (DSRI), which is a Kubernetes (OpenShift) cluster. Running multiple nodes for various projects in such a cluster would be beneficial.

Indeed, some cluster users would then also get a GPU assigned for a designated time window. However it is not that Gpu resources need to be added/removed dynamically. I would make that a different feature request if it is needed.

bdh-generalization / requirements

Support for running nodes on Kubernetes #6