`graphgpt_stage1_lightning` 在第一个epoch训练结束后发生nccl超时错误。

我在使用您提供的轻量化模型第一阶段训练的时候，在第一个epoch训练结束后，发生NCCL超时的错误，想请问一下，有什么办法解决。我是在八张4090下进行训练的，训练和测试的batchsize_per_device均为2.

[E ProcessGroupNCCL.cpp:488] Some NCCL operations have failed or timed out.
 Due to the asynchronous nature of CUDA kernels, subsequent GPU operations 
might run on corrupted/incomplete data.                                    
[E ProcessGroupNCCL.cpp:494] To avoid data inconsistency, we are taking the
 entire process down.                
[E ProcessGroupNCCL.cpp:915] [Rank 1] NCCL watchdog thread terminated with 
exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(
SeqNum=778778, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000
) ran for 1800413 milliseconds before timing out.                          
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1
] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=778778, OpT
ype=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800413 
milliseconds before timing out.

HKUDS / GraphGPT

`graphgpt_stage1_lightning` 在第一个epoch训练结束后发生nccl超时错误。 #44