ClearML. Configure ClearML queues to run on fixed set of nodes so that large queues will not be starved of nodes to execute - Githubissues

dhmlops / mlops

1 stars 0 forks source link

ClearML. Configure ClearML queues to run on fixed set of nodes so that large queues will not be starved of nodes to execute #6

Closed mantaphytoplankton closed 2 years ago

mantaphytoplankton commented 2 years ago

Currently, ClearML queues can run on any of the K8S nodes. This resulted in 4GPU queues (pods) unable to start job on any nodes if the nodes have at least one GPU used.

As @jax79sg proposed, need to configure ClearML queues affinity to specific worker nodes.

Discuss configuration for affinity of pods to nodes.
Label K8S worker nodes.
Configure nodeSelector/node affinity and redeploy ClearML glues.
Test affinity.

Configuration: https://docs.google.com/spreadsheets/d/1DESbljncKSuIzZ0osirxbLh3tcXklWcuoHpWG2Ly00Q/edit?usp=sharing

jax79sg commented 2 years ago

Please be advised that 4GPU node affinity has been implemented.

4-gpu jobs will now run exclusively on 5 worker nodes.
The rest of the worker nodes will handle the 1-gpu and 2-gpu jobs.