allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Apache License 2.0
5.42k stars 643 forks source link

Dynamic GPU/Queue Allocation for Workers in ClearML #1283

Open jabuarab opened 3 weeks ago

jabuarab commented 3 weeks ago

Description:

I am looking for guidance on:

Whether it's possible to configure multiple workers on the same machine with specific GPU assignments. How to set up conditional queues that check GPU availability. Suggestions or best practices for implementing a service-task to manage GPU allocation dynamically. I am currently using ClearML on a machine with two GPUs. I have configured a single worker that utilizes both GPUs and assigned a queue specifically for this worker. However, I want to improve the flexibility of resource allocation and task management. Specifically, I would like to achieve the following setup:

Three Workers on the Same Machine:

One worker assigned to GPU 0. One worker assigned to GPU 1. One worker that can utilize both GPUs.

Conditional Queues:

Use queues that can check GPU availability or whether a task is already running on the GPUs before enqueuing new tasks. Additionally, I have considered creating a service-task that checks GPU availability before enqueuing tasks into the appropriate queues, potentially managing this with a fourth queue.

Steps to Reproduce

Set up a machine with 2 GPUs. Install and configure ClearML with one worker that uses both GPUs. Attempt to create three separate workers with the described GPU assignments. Explore the possibility of setting up conditional queues or a service-task to manage GPU availability.

Expected Behavior

The system should be able to dynamically assign tasks to the appropriate worker based on GPU availability:

If GPU 0 is free, assign the task to the worker using GPU 0. If GPU 1 is free, assign the task to the worker using GPU 1. If both GPUs are free, assign the task to the worker using both GPUs. If a task is enqueued on both-gpus queue and the other workes are running a task,wait for them to wait and make sure they don´t start a new task

Actual Behavior

Currently, there is only one worker that utilizes both GPUs, which limits the flexibility in task management and GPU utilization.

Environment

OS: Ubuntu 22.04.4 LTS ClearML Version: WebApp: 1.15.0-472 • Server: 1.15.0-472 • API: 2.29 GPU: 2 x Tesla V100-FHHL-16GB

Thank you for your assistance!