Open ShuJackson opened 3 years ago
@TheAtticusProject
This is a very low-level issue, and unfortunately "NCCL Error 1: unhandled cuda error" means that even CUDA does not know what it is. I could only suggest updating drivers or seeing if there is a more detailed error log, but even then this would be a CUDA or hardware issue.
请问怎么运行脚本呢,需要修改什么文件和怎么执行代码可以教授我一二吗
When I run the training script, I ran into an instance of 'std::runtime_error' what(): NCCL Error 1: unhandled cuda error ./run.sh
This happens every time in the Evaluation step of the train.py script - after the 'convert squad examples to features' step completes successfully and right after 'Evaluating: 0%' is printed.
I have made sure torch can pick up the cuda info: