Closed agave233 closed 1 year ago
Hi, I think I know the solution for this problem, it's only tangentially related to #319, I suspect here's the cause is when one rank completes make_experience
ahead of others, due to filtering of empty responses.
Hi, I think I know the solution for this problem, it's only tangentially related to #319, I suspect here's the cause is when one rank completes
make_experience
ahead of others, due to filtering of empty responses.
Thanks! It sounds reasonable. So is this a bug or do I need to make some changes in PPO training to solve this problem?
@agave233 It is a bug, however it occurs under rather stochastic conditions, it will not be triggered if the model doesn't collapse to empty outputs. You could reduce learning rate, or increase batch size, if possible, to remedy that.
@agave233 It is a bug, however it occurs under rather stochastic conditions, it will not be triggered if the model doesn't collapse to empty outputs. You could reduce learning rate, or increase batch size, if possible, to remedy that.
Thanks for your suggestion. I have tried but it can not work.
I'm a bit curious why a certain process can complete this process first. Is there no inter-process synchronization mechanism in the process of making experiences?
Hi @agave233, could you post the script you've used and the git commit, so I can reproduce this particular bug? I'm closing in on a fix for it.
Is there no inter-process synchronization mechanism in the process of making experiences?
There was no need for it prior, except apparently the corner cases like yours
The timeout problem was resolved according to the latest code. Thanks 👍
I am facing the same problem, but the model does not even start training. It seems to timeout in some reduce operation. I am trying to train the 1B model on --num_processes 3. I am using the latest code. Any idea of what could go wrong?
Trace below
[E ProcessGroupNCCL.cpp:828] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1802544 milliseconds before timing out.
privsec0:2247378:2247601 [0] NCCL INFO comm 0x4e9f0990 rank 2 nranks 3 cudaDev 2 busId 41000 - Abort COMPLETE
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
privsec0:2247377:2247604 [0] NCCL INFO comm 0x4f31ffa0 rank 1 nranks 3 cudaDev 1 busId 23000 - Abort COMPLETE
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'terminate called after throwing an instance of 'std::runtime_errorstd::runtime_error'
'
what(): what(): [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1802503 milliseconds before timing out.[Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1802544 milliseconds before timing out.
[00:13:40] WARNING Sending process 2247376 closing signal SIGTERM api.py:698
[E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808417 milliseconds before timing out.
privsec0:2247376:2248280 [0] NCCL INFO [Service thread] Connection closed by localRank 0
privsec0:2247376:2248253 [0] NCCL INFO comm 0x45978320 rank 0 nranks 3 cudaDev 0 busId 1000 - Abort COMPLETE
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808417 milliseconds before timing out.
@javirandor Hm, it may be that it hangs on the first barrier (given SeqNum=1) here: https://github.com/CarperAI/trlx/blob/9bc08369ca9ec83342c4d7755205dab1a7723006/trlx/trainer/accelerate_base_trainer.py#L65-L66 Try commenting those lines and give it another attempt. Also have you tried running unmodified exisiting example on your setup, or does it also fail with the same error?
Hello,
I have successfully run the code summarize_rlhf with small SFT and RM models (bloom1b). However, when I try to run the larger model (7B), the timeout error is raised, which is a similar problem as stated in this issue #319. But I can not find a solution.
My environment: trlx version: 0.5.0 accelerate: 0.17.1 torch version: 1.13.1
The error is as follows: