Unable to train ASR model on multiple GPUs

NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)

https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html

Apache License 2.0

11.83k stars 2.46k forks source link

Unable to train ASR model on multiple GPUs #8214

Closed ericininder closed 7 months ago

ericininder commented 8 months ago

The following is my command to launch distributed training in NeMO

python3 -m torch.distributed.launch --nproc_per_node=2 --nnode=1 examples/asr/asr_ctc/speech_to_text_ctc.py --config-path=examples/asr/conf/quartznet/ --config-name=quartznet_15x5

torch vision：2.0.1+cu118 GPU：A100

then, the error would occur due to unrecognized argument：--local-rank

How do I change the source code from "--local_rank" to "--local-rank"

titu1994 commented 8 months ago

This is not the correct way to use distributed training in NeMo. We utilize pytorch lightning which calls ddp and torch distributed for you. Simply call your script with python ABC.py --config-path xyz/ --config-name something.yaml.

You can see some of the NeMo tutorials to see how we setup training with our scripts

ericininder commented 8 months ago

Thanks for your replying. I have already trained on one node & multiple GPU cards. @titu1994 Here's the another question, how to set up multi nodes' information such as MASTER_ADDRESS or MASTER_PORT by PyTorch Lightning. Is there any available tutorials? Thanks a lot.

titu1994 commented 8 months ago

It would be better to ask the folks at Pytorch Lightning since they might have the solution

github-actions[bot] commented 7 months ago

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 7 months ago

This issue was closed because it has been inactive for 7 days since being marked as stale.