Open sakemin opened 1 year ago
Any help here? Trying to do inference with multiple GPUs...
And "dora run -d" not work I have 8 GPUs and my script is as following:
dora run -d solver=musicgen/musicgen_base_32khz model/lm/model_scale=small continue_from=//pretrained/facebook/musicgen-small conditioner=text2music
however, it always can only find one workers:
[[36m08-17 10:28:15[0m][[34mroot[0m][[32mINFO[0m] - Getting pretrained compression model from HF facebook/encodec_32khz[0m
[[36m08-17 10:28:13[0m][[34mdora.distrib[0m][[32mINFO[0m] - world_size is 1, skipping init.[0m
[[36m08-17 10:28:13[0m][[34mflashy.solver[0m][[32mINFO[0m] - Instantiating solver MusicGenSolver for XP 4284c302[0m
So distributed inference is not supported.
Distributed training should work out of the box with dora run -d
. Can you check in python:
import torch
print(torch.cuda.device_count())
For multi node training, it is supported with SLURM, but without SLURM it is a bit more complex...
same issue here when trying to train musicgen with multiple GPUs with
dora run -d solver=musicgen/musicgen_base
and got
[08-30 12:17:47][dora.distrib][INFO] - world_size is 1, skipping init.
but actually I have 2 gpus
>>> import torch
>>> print(torch.cuda.device_count())
2
@yawnzh did you ever figure out a workaround for this? want to train musicgen on cloud GPUs that don't have SLURM set up.
And "dora run -d" not work I have 8 GPUs and my script is as following:
dora run -d solver=musicgen/musicgen_base_32khz model/lm/model_scale=small continue_from=//pretrained/facebook/musicgen-small conditioner=text2music
however, it always can only find one workers:
[�[36m08-17 10:28:15�[0m][�[34mroot�[0m][�[32mINFO�[0m] - Getting pretrained compression model from HF facebook/encodec_32khz�[0m [�[36m08-17 10:28:13�[0m][�[34mdora.distrib�[0m][�[32mINFO�[0m] - world_size is 1, skipping init.�[0m [�[36m08-17 10:28:13�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - Instantiating solver MusicGenSolver for XP 4284c302�[0m
same issue here when trying to train musicgen with multiple GPUs with
dora run -d solver=musicgen/musicgen_base
and got[08-30 12:17:47][dora.distrib][INFO] - world_size is 1, skipping init.
but actually I have 2 gpus>>> import torch >>> print(torch.cuda.device_count()) 2
almost same. did you solve this problem? @Maggione @yawnzh
Adding 'CUDA_VISIBLE_DEVICES' before "dora" may works.
Hello,
I have 8 * A40(48G) GPUs, so I wanna use them all for training and inferencing.
But I can't find the Multi-GPU things like DataParallel or DistributedDataParallel in train.py code, maybe they are wrapped with Dora.
And for the inferencing, I used the code from MUSICGEN.md below, by tweaking it. But seems like the
MusicGen
model
is not a child of nn.Module, and it haslm
inside it, so if I wrapmodel
asmodel = nn.DataParallel(model)
, it doesn't seem like it is using multi-gpus.Should I wrap
model.lm
asnn.DataParallel(model.lm)
? I wonder if the code still works since the codes are usinglm.generate()
, maybe it should be modified aslm.module.generate()
.Is there any pre-existing multi-gpu code in the repo?
Thanks.
Best, Sake