allegroai / clearml-agent

ClearML Agent - ML-Ops made easy. ML-Ops scheduler & orchestration solution
https://clear.ml/docs/
Apache License 2.0
229 stars 89 forks source link

Unable to Create Multiple Agents on Specified GPU #207

Open konstantinator opened 1 month ago

konstantinator commented 1 month ago

Hi!

I'm trying to create multiple agents on different GPUs. When I don’t specify a particular GPU, I can create many agents as shown below:

clearml-agent daemon --detached --queue hello_queue --docker hello_image
{worker created with worker ID "agent:0"}
clearml-agent daemon --detached --queue hello_queue --docker hello_image
{worker created with worker ID "agent:1"}

This creates two worker agents with different worker IDs:

However, the issue is that they both use the first of my two available GPUs.

But when I try to create multiple agents on a specific GPU like this:

clearml-agent daemon --detached --queue hello_queue --docker hello_image --gpus 1 # creating the first worker
{worker created with worker ID "agent:gpu1"}
clearml-agent daemon --detached --queue hello_queue --docker hello_image --gpus 1 # creating the second worker

I encounter an error while creating the second worker; the system reports that the worker ID "agent:gpu1" is already in use. Is there a way to automatically assign indices to such workers, perhaps like "agent:0-gpu1"?

Or is there another way to create multiple agents using a specific GPU?

ainoam commented 1 month ago

@konstantinator can you elaborate on the use case that leads you to running multiple agents running concurrently on the same GPU?

konstantinator commented 1 month ago

@konstantinator can you elaborate on the use case that leads you to running multiple agents running concurrently on the same GPU?

Hi, @ainoam ! I have 2 GPUs, and my current data training configuration uses just under half of the video memory of one GPU. I've calculated that I can run 4 workers for training to tune optimal training parameters (like tuning the learning rate).