RulinShao / LightSeq

Official repository for LightSeq: Sequence Level Parallelism for Distributed Training of Long Context Transformers
179 stars 8 forks source link

error when use huggingface trainer #9

Open linyubupa opened 4 months ago

linyubupa commented 4 months ago

Is this project unable to use Hugging Face's Trainer? when using trainer , i just got stuck on "Initializing global memoery buffer." and then get the error below [rank0]:[E ProcessGroupNCCL.cpp:523] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=133, OpType=_ALLGATHER_BASE, NumelIn=86511616, NumelOut=173023232, Timeout(ms)=600000) ran for 600752 milliseconds before timing out. [rank0]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.