Open 646312715 opened 2 years ago
It is because the original command is by default meant to be used with a lot of gpus.
See the config you used: fairseq/examples/hubert/config/pretrain/hubert_base_librispeech.yaml 's distributed_training
It says distributed_world_size: 32 distributed_port: 29671 nprocs_per_node: 8
For fairseq these days, you can just delete those and it should run fine (by detecting your environment automatically)
ohh! Thank you for reminding me. I'm actually looking for the location of the multi GPU. Now I know it's in yaml. Thank you again.
distributed_training:
ddp_backend: c10d
find_unused_parameters: true
distributed_world_size: 2 # num of gpu
distributed_port: 29671
nprocs_per_node: 2 # num of gpu
If I want to use two GPUs, how should this part be modified?
Thanks, do you have solved it? I meet such problem too. iguohm@163.com
❓ Questions and Help
Before asking:
What is your question?
When I followed the prompts to perform Hubert pre training, a runtime error occurred:
Code
Traceback (most recent call last): File "fairseq_cli/hydra_train.py", line 27, in hydra_main _hydra_main(cfg) File "fairseq_cli/hydra_train.py", line 56, in _hydra_main distributed_utils.call_main(cfg, pre_main, kwargs) File "/home/oem/fairseq/fairseq/distributed/utils.py", line 369, in call_main main(cfg, kwargs) File "/home/oem/fairseq/fairseq_cli/train.py", line 147, in main trainer = Trainer(cfg, task, model, criterion, quantizer) File "/home/oem/fairseq/fairseq/trainer.py", line 164, in init if self.data_parallel_rank == 0: File "/home/oem/fairseq/fairseq/trainer.py", line 197, in data_parallel_rank return distributed_utils.get_data_parallel_rank() File "/home/oem/fairseq/fairseq/distributed/utils.py", line 463, in get_data_parallel_rank return get_rank(get_data_parallel_group()) File "/home/oem/fairseq/fairseq/distributed/utils.py", line 405, in get_rank return dist.get_rank(group=group) File "/home/oem/.conda/envs/squidTorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 844, in get_rank default_pg = _get_default_group() File "/home/oem/.conda/envs/squidTorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 429, in _get_default_group raise RuntimeError( RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
What have you tried?
I have found some answers on the Internet, saying that it is because of the setting of distributed training.My computer has only one GPU. So I don't know where to modify it. Or there may be other reasons, I hope to get an answer.
What's your environment?
pip
, source): git