HKUDS / GraphGPT

[SIGIR'2024] "GraphGPT: Graph Instruction Tuning for Large Language Models"
https://arxiv.org/abs/2310.13023
Apache License 2.0
493 stars 36 forks source link

`graphgpt_stage1_lightning` 在第一个epoch训练结束后发生nccl超时错误。 #44

Closed smurf-1119 closed 4 months ago

smurf-1119 commented 6 months ago

我在使用您提供的轻量化模型第一阶段训练的时候,在第一个epoch训练结束后,发生NCCL超时的错误,想请问一下,有什么办法解决。我是在八张4090下进行训练的,训练和测试的batchsize_per_device均为2.

[E ProcessGroupNCCL.cpp:488] Some NCCL operations have failed or timed out.
 Due to the asynchronous nature of CUDA kernels, subsequent GPU operations 
might run on corrupted/incomplete data.                                    
[E ProcessGroupNCCL.cpp:494] To avoid data inconsistency, we are taking the
 entire process down.                
[E ProcessGroupNCCL.cpp:915] [Rank 1] NCCL watchdog thread terminated with 
exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(
SeqNum=778778, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000
) ran for 1800413 milliseconds before timing out.                          
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1
] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=778778, OpT
ype=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800413 
milliseconds before timing out.   
tjb-tech commented 4 months ago

您好根据您提供的报错信息暂时无法检查出这是什么错误,您可以提供更多报错信息嘛。或者我们建议您重新运行一遍,可能是偶发故障导致的,与代码关系不大