TheAtticusProject / cuad

CUAD (NeurIPS 2021)
https://www.atticusprojectai.org/cuad
379 stars 113 forks source link

NCCL Error 1: unhandled cuda error #9

Open ShuJackson opened 3 years ago

ShuJackson commented 3 years ago

When I run the training script, I ran into an instance of 'std::runtime_error' what(): NCCL Error 1: unhandled cuda error ./run.sh

This happens every time in the Evaluation step of the train.py script - after the 'convert squad examples to features' step completes successfully and right after 'Evaluating: 0%' is printed.

I have made sure torch can pick up the cuda info:

print(torch.cuda.is_available()) True

image

ShuJackson commented 3 years ago

@TheAtticusProject

hendrycks commented 3 years ago

This is a very low-level issue, and unfortunately "NCCL Error 1: unhandled cuda error" means that even CUDA does not know what it is. I could only suggest updating drivers or seeing if there is a more detailed error log, but even then this would be a CUDA or hardware issue.

Mei0211 commented 2 years ago

请问怎么运行脚本呢,需要修改什么文件和怎么执行代码可以教授我一二吗