Open SingL3 opened 1 year ago
We have not seen this error during our training runs. Could you try smaller/different models first? Are you using the latest version of deepspeed? Which GPU and cuda version are you using? Do you have access to a different machine on which you could cross-check?
I am trying to run pretrain of LLaMA 30b. And here is my running cmd:
And after the model was loaded, it stucked for a long time(I think it was 30 mins for the default timeout of pytorch is 30mins). And this error is raised:
Any solutions?