determined-ai / determined

Determined is an open-source machine learning platform that simplifies distributed training, hyperparameter tuning, experiment tracking, and resource management. Works with PyTorch and TensorFlow.
https://determined.ai
Apache License 2.0
2.92k stars 348 forks source link

💡[feat] the request to add a feature that releases resources automatically in case of a timeout or if the GPU utilization falls below a certain threshold #9555

Open KyanChen opened 2 weeks ago

KyanChen commented 2 weeks ago

Describe the problem

Please implement a functionality in both the command prompt (cmd) and the shell environment that allows for automatic resource release in the event of a timeout or if the GPU utilization is too low.

Describe the solution you'd like

Please implement a functionality in both the command prompt (cmd) and the shell environment that allows for automatic resource release in the event of a timeout or if the GPU utilization is too low.

Describe alternatives you've considered

No response

Additional context

No response

ioga commented 1 week ago

hello,

we do not have this as a built-in feature. One of our engineers developed an unofficial script which can terminate a process if it's not utilizing GPU. you can integrate it into your CMDs to do what you want. doing that with interactive shells or notebooks would probably be a bit trickier.

KyanChen commented 1 week ago

Can you offer a implementation?

KyanChen commented 1 week ago

I have developed a python script. master node runs: https://github.com/KyanChen/GPUClusterConfig/blob/dev/gpu_parser.py worker node runs: https://github.com/KyanChen/GPUClusterConfig/blob/dev/gpu_monitor.py

ioga commented 1 week ago

I cannot offer a reference implementation besides the script I've already shared. your code looks fine at a quick glance. it does not seem like it's going to handle the case when you have multiple GPUs per node. but hey, if it works for you, I see no problem with you using it.