Select which gpu to train on

Deci-AI / super-gradients

Easily train or fine-tune SOTA computer vision models with one open source training library. The home of Yolo-NAS.

https://www.supergradients.com

Apache License 2.0

4.59k stars 510 forks source link

Select which gpu to train on #913

Closed eroshacinas closed 1 year ago

eroshacinas commented 1 year ago

Is your feature request related to a problem? Please describe.

I am trying to train yolo-nas in an environment with multiple gpus (device=0 and 1), but I only want to use 1. Is there a way to train the model specifically in the 2nd gpu (device=1)

Describe the solution you'd like

Train the model by only using the 2nd gpu

Describe alternatives you've considered

Nothing really.

dagshub[bot] commented 1 year ago

Join the discussion on DagsHub!

BloodAxe commented 1 year ago

export CUDA_VISIBLE_DEVICES=0 python -m super_gradients.train_from_recipe ...

eroshacinas commented 1 year ago

Hi. Is there a way to do it without using a recipe or through trainer.train?

BloodAxe commented 1 year ago

You can always pass target device to instance of Trainer manually if you want: trainer = Trainer(..., device = "cuda:1")

eroshacinas commented 1 year ago

Already tried, but it's giving me a KeyError: "Trainer does not accept anymore 'device' and 'multi_gpu' as argument. Both should instead be passed to super_gradients.setup_device(device=..., multi_gpu=..., num_gpus=...)"

Louis-Dupont commented 1 year ago

You can set it with env variables at the beginning of your script


import os
from super_gradients import Trainer

os.environ['CUDA_VISIBLE_DEVICES'] = "1"

# Unchanged
trainer = Trainer(...)
trainer.train(...)

eroshacinas commented 1 year ago

os.environ['CUDA_VISIBLE_DEVICES'] = "1"

This solved it, thank you!

Louis-Dupont commented 1 year ago

Already tried, but it's giving me a KeyError: "Trainer does not accept anymore 'device' and 'multi_gpu' as argument. Both should instead be passed to super_gradients.setup_device(device=..., multi_gpu=..., num_gpus=...)"

Trainer(device=...) is legacy and will soon be removed from SG.

The same can be done with

from super_gradients.training.utils.distributed_training_utils import setup_device

setup_device(device=...)

But this is only useful if you want to train on cpu, or with multiple gpus (DP/DDP). The default behavior is single GPU so you don't need to call it in your case