Closed amejri closed 4 years ago
can you please provide more details? (As in bug template).
But this error can occur if you are using non-ddp backend when training on multi-GPU
You are right. The code works fine with a single GPU. How do I put it on multi-GPU ?
Hi I have fine tuned the nemo model, and I am trying to save the model using torch.save(quartznet,'fine_tuned') but I end up with the same error, do you have any idea about it?
Hello, to save the model, I have just used this command : quartznet.save_to(path)
Hi @amejri, how do you fix error when training on multi-gpus?
I also met this problem. Through check out the docs of pytorch lightning, I found that this error could be solved by adding a parameter " accelerator='ddp' " to trainer. Therefore, the code works fine with multi-GPU since the doc explains that "('ddp') is DistributedDataParallel (each gpu on each node trains, and syncs grads)" https://pytorch-lightning.readthedocs.io/en/latest/common/trainer.html
I also met this problem. Through check out the docs of pytorch lightning, I found that this error could be solved by adding a parameter " accelerator='ddp' " to trainer. Therefore, the code works fine with multi-GPU since the doc explains that "('ddp') is DistributedDataParallel (each gpu on each node trains, and syncs grads)" https://pytorch-lightning.readthedocs.io/en/latest/common/trainer.html
That is it!! thanks
I also met this problem. Through check out the docs of pytorch lightning, I found that this error could be solved by adding a parameter " accelerator='ddp' " to trainer. Therefore, the code works fine with multi-GPU since the doc explains that "('ddp') is DistributedDataParallel (each gpu on each node trains, and syncs grads)" https://pytorch-lightning.readthedocs.io/en/latest/common/trainer.html
As per pytorch lightning docs there is no option for 'ddp'. https://pytorch-lightning.readthedocs.io/en/latest/accelerators/gpu_basic.html Also I have tried by providing accelerator="gpu" and devices=4 but getting same error.
Ptl added another argument called trainer.strategy="ddp". All of these flags are set in our configs please refer to them
I'm facing the same error with FastPitch TTS, any help? nemo image: 22.12
Hi guys, I am trying to train an ASR with nemo. I preprocessed data as shown on the tutorial. but when I tried to train the model, I have the following error : AttributeError: Can't pickle local object 'FilterbankFeatures.init..'.
Can someone help me ?
Thanks in advance.