Closed CannonWWW closed 3 weeks ago
Hello, we've checked the same configuration on 8 NVIDIA RTX A100 and it works out fine. Could you please pull our latest codes and run again? Feel free to report! Thank you!
Hello, we've checked the same configuration on 8 NVIDIA RTX A100 and it works out fine. Could you please pull our latest codes and run again? Feel free to report! Thank you!
Thank you for your comment. The issue I encountered might be related to hardware failure. After carefully reviewing the code, I have successfully run the project.
I set the environment variables as follow in train_dist.sh in gpt_hf folder:
I used a cluster with 8 NVIDIA RTX A6000 and cuda version is release 11.8, V11.8.89 Then I encoutered error as follow:
Could you please help me resolve this issue? Or could you provide some possible solutions? Thank you for your help and support!