Closed ericininder closed 7 months ago
This is not the correct way to use distributed training in NeMo. We utilize pytorch lightning which calls ddp and torch distributed for you. Simply call your script with python ABC.py --config-path xyz/ --config-name something.yaml.
You can see some of the NeMo tutorials to see how we setup training with our scripts
Thanks for your replying. I have already trained on one node & multiple GPU cards. @titu1994 Here's the another question, how to set up multi nodes' information such as MASTER_ADDRESS or MASTER_PORT by PyTorch Lightning. Is there any available tutorials? Thanks a lot.
It would be better to ask the folks at Pytorch Lightning since they might have the solution
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.
This issue was closed because it has been inactive for 7 days since being marked as stale.
The following is my command to launch distributed training in NeMO
python3 -m torch.distributed.launch --nproc_per_node=2 --nnode=1 examples/asr/asr_ctc/speech_to_text_ctc.py --config-path=examples/asr/conf/quartznet/ --config-name=quartznet_15x5
torch vision:2.0.1+cu118 GPU:A100
then, the error would occur due to unrecognized argument:--local-rank
How do I change the source code from "--local_rank" to "--local-rank"