bigscience-workshop / Megatron-DeepSpeed

Ongoing research training transformer language models at scale, including: BERT & GPT-2
Other
1.3k stars 211 forks source link

The given group does not exist pytorch #379

Open germanjke opened 1 year ago

germanjke commented 1 year ago

Do you know why i got this problem with pretrain_gpt_single_node.sh? I'm setting N_GPUS=1 and got

File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 191, in _get_group_rank
    raise RuntimeError("The given group does not exist")
RuntimeError: The given group does not exist

from

Megatron-DeepSpeed/megatron/training.py", line 400, in setup_model_and_optimizer
    model = get_model(model_provider_func)

i'm using NCG docker with pytorch and apex, deepspeed and other packages installed from you requirements.txt

my setup is 2x 3090

LYF915 commented 1 year ago

I also encountered this problem, did you solve the problem?

zql022 commented 10 months ago

me too, how did you solved this problem?