facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.22k stars 6.38k forks source link

A runtime error occurred while running Hubert.Default process group has not been initialized, please make sure to call init_process_group. #4602

Open 646312715 opened 2 years ago

646312715 commented 2 years ago

❓ Questions and Help

Before asking:

  1. search the issues.
  2. search the docs.

What is your question?

When I followed the prompts to perform Hubert pre training, a runtime error occurred:

Code

Traceback (most recent call last): File "fairseq_cli/hydra_train.py", line 27, in hydra_main _hydra_main(cfg) File "fairseq_cli/hydra_train.py", line 56, in _hydra_main distributed_utils.call_main(cfg, pre_main, kwargs) File "/home/oem/fairseq/fairseq/distributed/utils.py", line 369, in call_main main(cfg, kwargs) File "/home/oem/fairseq/fairseq_cli/train.py", line 147, in main trainer = Trainer(cfg, task, model, criterion, quantizer) File "/home/oem/fairseq/fairseq/trainer.py", line 164, in init if self.data_parallel_rank == 0: File "/home/oem/fairseq/fairseq/trainer.py", line 197, in data_parallel_rank return distributed_utils.get_data_parallel_rank() File "/home/oem/fairseq/fairseq/distributed/utils.py", line 463, in get_data_parallel_rank return get_rank(get_data_parallel_group()) File "/home/oem/fairseq/fairseq/distributed/utils.py", line 405, in get_rank return dist.get_rank(group=group) File "/home/oem/.conda/envs/squidTorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 844, in get_rank default_pg = _get_default_group() File "/home/oem/.conda/envs/squidTorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 429, in _get_default_group raise RuntimeError( RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

What have you tried?

I have found some answers on the Internet, saying that it is because of the setting of distributed training.My computer has only one GPU. So I don't know where to modify it. Or there may be other reasons, I hope to get an answer.

What's your environment?

gmryu commented 2 years ago

It is because the original command is by default meant to be used with a lot of gpus. See the config you used: fairseq/examples/hubert/config/pretrain/hubert_base_librispeech.yaml 's distributed_training It says distributed_world_size: 32 distributed_port: 29671 nprocs_per_node: 8

For fairseq these days, you can just delete those and it should run fine (by detecting your environment automatically)

646312715 commented 2 years ago

ohh! Thank you for reminding me. I'm actually looking for the location of the multi GPU. Now I know it's in yaml. Thank you again.

Peter-SungwooCho commented 5 months ago
distributed_training:
  ddp_backend: c10d
  find_unused_parameters: true
  distributed_world_size: 2 # num of gpu
  distributed_port: 29671
  nprocs_per_node: 2 # num of gpu

If I want to use two GPUs, how should this part be modified?