RuntimeError: Timed out initializing process group in store based barrier on rank 2

LAION-AI / Open-Assistant

OpenAssistant is a chat-based assistant that understands tasks, can interact with third-party systems, and retrieve information dynamically to do so.

https://open-assistant.io

Apache License 2.0

37.1k stars 3.24k forks source link

RuntimeError: Timed out initializing process group in store based barrier on rank 2 #3626

Open SingL3 opened 1 year ago

SingL3 commented 1 year ago

I am trying to run pretrain of LLaMA 30b. And here is my running cmd:

deepspeed trainer_sft.py --configs defaults llama-30b-pretrain pretrain --cache_dir $DATA_PATH --output_dir $MODEL_PATH/llama-30b-pre --deepspeed

And after the model was loaded, it stucked for a long time(I think it was 30 mins for the default timeout of pytorch is 30mins). And this error is raised:

RuntimeError: Timed out initializing process group in store based barrier on rank 2 # for all rank

Any solutions?

andreaskoepf commented 1 year ago

We have not seen this error during our training runs. Could you try smaller/different models first? Are you using the latest version of deepspeed? Which GPU and cuda version are you using? Do you have access to a different machine on which you could cross-check?

SingL3 commented 1 year ago

@andreaskoepf Yes, at least latest version last week and deepspeed. I am using 8xA100(80G) with cuda 11.7. I have tried reducing pretrain datasets here(only alpaca_gpt4 is reserved) and it can run successfully so I dont think it is the reason of the model.