PPO training stucks - Githubissues

🐛 Describe the bug

I am running example/summarize_rlhf.
I have successfully run the code when the make_experience function was in orchestrator/ppo_orchestrator a few days ago. However, after syncing with the latest version (main branch), I find that the PPO training hangs and raise the following timeout error:
[rollout 0 / 128]:   0%|                                                                                                   
          | 0/128 [00:00<?, ?it/s]You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using
 the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a pad
ded encoding.                                                                                                              
Using /home/wenjiaxin/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...                                     
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0006618499755859375 seconds
You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster t
han using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Using /home/wenjiaxin/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...                                     
No modifications detected for re-loaded extension module utils, skipping build step...                                     
Loading extension module utils...
Time to load utils op: 0.0005791187286376953 seconds
You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster t
han using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Using /home/wenjiaxin/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0007822513580322266 seconds
You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster t
han using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
/data/wenjiaxin/home/trlx/trlx/trainer/accelerate_ppo_trainer.py:314: UserWarning: To copy construct from a tensor, it is r
ecommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.te
nsor(sourceTensor).
  all_scores = torch.tensor(
[rollout 80 / 128]:  62%|█████████████████████████████████████████████████████████████▉
[rollout 96 / 128]:  62%|█████████████████████████████████████████████████████████████▉
[rollout 96 / 128]:  75%|██████████████████████████████████████████████████████████████████████████▎
[rollout 112 / 128]:  75%|█████████████████████████████████████████████████████████████████████████▌
[rollout 112 / 128]:  88%|████████████████████████████████████████████████████████████████████████████████████▉
[rollout 128 / 128]:  88%|████████████████████████████████████████████████████████████████████████████████████▉
[rollout 128 / 128]: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████
[rollout 128 / 128]: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████
| 128/128 [06:23<00:00,  3.00s/it]
[RANK 0] Starting training
[RANK 0] Evaluating model
[generation sweep 1/1 | eval batch 2/2]: 100%|███████████████████████████████████████████████| 2/2 [00:25<00:00, 12.98s/it]
[RANK 0] Computing rewards
/data/wenjiaxin/home/trlx/trlx/trainer/accelerate_base_trainer.py:364: UserWarning: To copy construct from a tensor, it is
recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.t
ensor(sourceTensor).
  rewards = torch.tensor(
[RANK 0] Summarizing evaluation                                                                                            
                                             Evaluation #0 reward/mean: 0.105                                              
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓
┃ prompt                                                ┃ output                                                 ┃ reward ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩
│ SUBREDDIT: r/AskReddit                                │  I have feelings for someone else, and I want them to  │ -0.77  │
│ TITLE: How do you get someone out of your head?       │ go away. I don't know how to do it.                    │        │
│ POST: Hi,                                             │                                                        │        │
│ I'm 22, and I have been with my girlfriend for 5      │                                                        │        │
│ years now. We recently moved together. We've always   │                                                        │        │
│ loved each other intensely.                           │                                                        │        │
│                                                       │                                                        │        │
│ Problem, I recently started to have feelings for an   │                                                        │        │
│ other person (a friend). This person has had a        │                                                        │        │
│ boyfriend for now 3 years, and has absolutely no      │                                                        │        │
│ ideas. Those feelings were so strong, it was hard to  │                                                        │        │
│ hide them. After 2 months of me being distant and     │                                                        │        │
│ really sad, my girlfriend forced me to say what was   │                                                        │        │
│ bothering me. I'm not a good liar, and now she knows. │                                                        │        │
│                                                       │                                                        │        │
│ We decided to give us a week alone, I went to my      │                                                        │        │
│ parents.                                              │                                                        │        │
│                                                       │                                                        │        │
│ Now, I'm completely lost. I keep on thinking about    │                                                        │        │
│ this person, and I hate that. I would like for those  │                                                        │        │
│ feelings to go away, to leave me alone. But I can't.  │                                                        │        │
│                                                       │                                                        │        │
│ What do I do? It's been 3 months now, and I'm just    │                                                        │        │
│ desperate.                                            │                                                        │        │
│ TL;DR:                                                │                                                        │        │
├───────────────────────────────────────────────────────┼────────────────────────────────────────────────────────┼────────┤
│ SUBREDDIT: r/pettyrevenge                             │  My mom woke me up with loud TV, I turned my speakers  │ 0.55   │
│ TITLE: So, my mom woke me up with a loud TV.          │ up really loud and blasted Gangnam Style on repeat,    │        │
│ POST: She was in her living room, watching TV. This   │ making a lot of noise.                                 │        │
│ was at about 8:30 in the morning, and she was         │                                                        │        │
│ exercising. She turned the TV up extra loud to hear   │                                                        │        │
│ it over her excercycle, and woke me up. I went in     │                                                        │        │
│ there asking for her to turn it down. She said she    │                                                        │        │
│ didn't have to; I explained that I always used        │                                                        │        │
│ headphones so she didn't have to deal with my noise   │                                                        │        │
│ and that she should give me a little more respect,    │                                                        │        │
│ given that I paid rent at the time.                   │                                                        │        │
│                                                       │                                                        │        │
│ She disagreed. I went back to my room, rather pissed  │                                                        │        │
│ off at the lack of equality. I had no lock on my      │                                                        │        │
│ door; but I had a dresser right next to it, so I      │                                                        │        │
│ pulled one of the drawers out enough so that it       │                                                        │        │
│ caused the door to not be openable. Then, I turned my │                                                        │        │
│ speakers up really loud and blasted Gangnam Style on  │                                                        │        │
│ repeat, with the bass cranked up as high as it could  │                                                        │        │
│ go.                                                   │                                                        │        │
│                                                       │                                                        │        │
│ If you hate Gangnam Style for being overplayed, you   │                                                        │        │
│ will see why I chose that particular song. I          │                                                        │        │
│ personally don't mind it. But here's the thing about  │                                                        │        │
│ my bass; it vibrates the walls, making one hell of a  │                                                        │        │
│ lot of noise. Needless to say, my mom was not pleased │                                                        │        │
│ and shut off the internet. But it was oh so worth it. │                                                        │        │
│ TL;DR:                                                │                                                        │        │
├───────────────────────────────────────────────────────┼────────────────────────────────────────────────────────┼────────┤
│ SUBREDDIT: r/relationships                            │  Girlfriend cheated on me by kissing two guys at a     │ -0.104 │
│ TITLE: My girlfriend (20f) of two years cheated on me │ party. We both want to fix things but I'm not sure if  │        │
│ (20m) by kissing two guys at a Halloween party.       │ I should.                                              │        │
│ POST: Lately her and I have been having a few         │                                                        │        │
│ problems, and these problems have been brought up     │                                                        │        │
│ before a few times. One problem being that I don't    │                                                        │        │
│ show enough affection. I don't tell her she's pretty  │                                                        │        │
│ very often or don't compliment her much. I feel       │                                                        │        │
│ terrible about it, but this time I was really trying  │                                                        │        │
│ to change for her.                                    │                                                        │        │
│                                                       │                                                        │        │
│ For Halloween she went to visit her step brother at a │                                                        │        │
│ college and I got drunk with my friends and watched   │                                                        │        │
│ movies. Last night (11/1) we got in a huge fight      │                                                        │        │
│ about me not changing and how our relationship won't  │                                                        │        │
│ work out and basically broke up over the phone. So in │                                                        │        │
│ an effort to try and fix it I drove to her house. She │                                                        │        │
│ told me how at the parties she went to that two guys  │                                                        │        │
│ kissed her. The first one she pushed away, but the    │                                                        │        │
│ second one I asked her if she kissed him back and she │                                                        │        │
│ said yes and that she did it because it made her feel │                                                        │        │
│ wanted, which I guess I haven't been making her feel  │                                                        │        │
│ that way lately. We cried, we talked about            │                                                        │        │
│ everything, we had great sex, and I stayed over at    │                                                        │        │
│ her house just to sleep with her and then snuck out   │                                                        │        │
│ in the morning so her parents wouldn't know.          │                                                        │        │
│                                                       │                                                        │        │
│ We both obviously want to work things out but aren't  │                                                        │        │
│ sure if we should. I love this girl, but the more I   │                                                        │        │
│ think about it, all I can think about is her cheating │                                                        │        │
│ on me, and more importantly, liking it. It makes me   │                                                        │        │
│ sick to my stomach. Should I even try to fix it or    │                                                        │        │
│ would I be better off cutting all ties.               │                                                        │        │
│ TL;DR:                                                │                                                        │        │
└───────────────────────────────────────────────────────┴────────────────────────────────────────────────────────┴────────
  0%|                                                                                             | 0/3200 [00:00<?, ?it/s]
[2023-02-21 21:58:49,898] [INFO] [stage_1_and_2.py:1762:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss sc
ale: 65536, reducing to 65536                                                                                              
[losses/total_loss: -0.31 | losses/policy_loss: -0.33 | losses/value_loss: 0.08]:   0%| | 1/3200 [00:01<1:21:59,  1.54s/it]
[2023-02-21 21:58:51,427] [INFO] [stage_1_and_2.py:1762:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss sc
ale: 65536, reducing to 32768.0
[losses/total_loss: 0.14 | losses/policy_loss: 0.13 | losses/value_loss: 0.06]:   2%|  | 64/3200 [06:50<5:38:48,  6.48s/it]
[RANK 0] Collecting rollouts
[rollout 128 / 128]: 100%|███████████████████████████████████████████████████████████████| 128/128 [07:03<00:00,  3.30s/it]
[2023-02-21 22:12:43,999] [INFO] [stage_1_and_2.py:1762:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss sc
ale: 32768.0, reducing to 16384.0
[losses/total_loss: -0.28 | losses/policy_loss: -0.33 | losses/value_loss: 0.23]:   2%| | 65/3200 [13:55<114:53:35, 131.93s
[2023-02-21 22:12:45,560] [INFO] [stage_1_and_2.py:1762:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss sc
ale: 16384.0, reducing to 8192.0
[losses/total_loss: -0.28 | losses/policy_loss: -0.33 | losses/value_loss: 0.23]:   2%| | 66/3200 [13:57<80:48:25, 92.82s/i
[2023-02-21 22:12:47,125] [INFO] [stage_1_and_2.py:1762:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss sc
ale: 8192.0, reducing to 4096.0
[losses/total_loss: -0.28 | losses/policy_loss: -0.33 | losses/value_loss: 0.23]:   2%| | 67/3200 [13:58<56:57:20, 65.45s/i
[2023-02-21 22:12:48,690] [INFO] [stage_1_and_2.py:1762:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss sc
ale: 4096.0, reducing to 2048.0
[losses/total_loss: 0.22 | losses/policy_loss: 0.13 | losses/value_loss: 0.42]:   2%|  | 76/3200 [14:53<7:36:46,  8.77s/it]
[2023-02-21 22:13:43,188] [INFO] [stage_1_and_2.py:1762:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss sc
ale: 2048.0, reducing to 1024.0
[losses/total_loss: 0.80 | losses/policy_loss: 0.73 | losses/value_loss: 0.34]:   2%|  | 77/3200 [14:54<5:44:06,  6.61s/it]
[2023-02-21 22:13:44,752] [INFO] [stage_1_and_2.py:1762:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss sc
ale: 1024.0, reducing to 512.0
[losses/total_loss: 0.80 | losses/policy_loss: 0.73 | losses/value_loss: 0.34]:   2%|  | 78/3200 [14:56<4:25:12,  5.10s/it]
[2023-02-21 22:13:46,317] [INFO] [stage_1_and_2.py:1762:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss sc
ale: 512.0, reducing to 256.0
[losses/total_loss: 0.80 | losses/policy_loss: 0.73 | losses/value_loss: 0.34]:   2%|  | 79/3200 [14:57<3:30:00,  4.04s/it]
[2023-02-21 22:13:47,882] [INFO] [stage_1_and_2.py:1762:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss sc
ale: 256.0, reducing to 128.0
[losses/total_loss: -0.27 | losses/policy_loss: -0.30 | losses/value_loss: 0.18]:   4%| | 128/3200 [20:11<5:32:27,  6.49s/i
[RANK 0] Collecting rollouts
[rollout 128 / 128]: 100%|███████████████████████████████████████████████████████████████| 128/128 [06:48<00:00,  3.19s/it]
[losses/total_loss: 0.38 | losses/policy_loss: 0.18 | losses/value_loss: 1.02]:   6%| | 192/3200 [33:59<5:29:37,  6.58s/it]
[RANK 0] Collecting rollouts
[rollout 134 / 128]: : 134it [08:45,  3.92s/it]
[E ProcessGroupNCCL.cpp:821] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=696, OpType=ALLREDUCE, 
Timeout(ms)=1800000) ran for 1802485 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:821] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=696, OpType=BROADCAST,
Timeout(ms)=1800000) ran for 1802763 milliseconds before timing out.
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /data/wenjiaxin/home/trlx/trlx/trainer/accelerate_ppo_trainer.py:277 in make_experience          │
│                                                                                                  │
│   274 │   │   │   # TOOD (jon-tow): Make `prompt_dataloader` a cyclic/infinite DataLoader to n   │
│   275 │   │   │   # "refreshing" the contents of the `prompt_iterator`                           │
│   276 │   │   │   try:                                                                           │
│ ❱ 277 │   │   │   │   batch: PromptBatch = next(self.prompt_iterator)                            │
│   278 │   │   │   except StopIteration:                                                          │
│   279 │   │   │   │   self.prompt_iterator = iter(self.prompt_dataloader)                        │
│   280 │   │   │   │   batch = next(self.prompt_iterator)                                         │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
StopIteration

During handling of the above exception, another exception occurred:

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /data/wenjiaxin/home/trlx/examples/summarize_rlhf/trlx_gptj_text_summarization.py:142 in         │
│ <module>                                                                                         │
│                                                                                                  │
│   139 │   #         import ipdb; ipdb.set_trace()                                                │
│   140 │   # exit(0)                                                                              │
│   141 │                                                                                          │
│ ❱ 142 │   trainer = trlx.train(                                                                  │
│   143 │   │   reward_fn=reward_fn,                                                               │
│   144 │   │   prompts=train_prompts,                                                             │
│   145 │   │   eval_prompts=val_prompts[0:1000],  # sampling 1000 validation prompts for evalua   │
│                                                                                                  │
│ /data/wenjiaxin/home/trlx/trlx/trlx.py:119 in train                                              │
│                                                                                                  │                       
│   116 │   eval_pipeline = get_pipeline(config.train.pipeline)(eval_prompts, max_prompt_length,   │                       
│   117 │   trainer.add_eval_pipeline(eval_pipeline)                                               │                       
│   118 │                                                                                          │                       
│ ❱ 119 │   trainer.learn()                                                                        │
│   120 │   return trainer                                                                         │
│   121                                                                                            │
│                                                                                                  │
│ /data/wenjiaxin/home/trlx/trlx/trainer/accelerate_base_trainer.py:550 in learn                   │
│                                                                                                  │
│   547 │   │   │   │                                                                              │
│   548 │   │   │   │   self.post_backward_callback()                                              │
│   549 │   │   │                                                                                  │
│ ❱ 550 │   │   │   self.post_epoch_callback() # 重新rollout                                       │
│   551 │   │   tbar.close()                                                                       │
│   552 │                                                                                          │
│   553 │   @abstractmethod                                                                        │
│                                                                                                  │
│ /data/wenjiaxin/home/trlx/trlx/trainer/accelerate_ppo_trainer.py:216 in post_epoch_callback      │
│                                                                                                  │
│   213 │   │   │   self.store.export_history(location=self.rollout_logging_dir)                   │
│   214 │   │   self.store.clear_history()                                                         │
│   215 │   │   # Collect more rollouts for training                                               │
│ ❱ 216 │   │   self.make_experience(self.config.method.num_rollouts, self.iter_count)             │
│   217 │                                                                                          │
│   218 │   def post_backward_callback(self):                                                      │
│   219 │   │   self.kl_ctl.update(self.approx_kl, n_steps=self.config.train.batch_size)           │
│                                                                                                  │
│ /data/wenjiaxin/home/trlx/trlx/trainer/accelerate_ppo_trainer.py:280 in make_experience          │
│                                                                                                  │
│   277 │   │   │   │   batch: PromptBatch = next(self.prompt_iterator)                            │
│   278 │   │   │   except StopIteration:                                                          │
│   279 │   │   │   │   self.prompt_iterator = iter(self.prompt_dataloader)                        │
│ ❱ 280 │   │   │   │   batch = next(self.prompt_iterator)                                         │
│   281 │   │   │                                                                                  │
│   282 │   │   │   exp_generate_time = time()                                                     │
│   283                                                                                            │
│                                                                                                  │
│ /data/wenjiaxin/anaconda3/envs/rl/lib/python3.8/site-packages/accelerate/data_loader.py:369 in   │
│ __iter__                                                                                         │
│                                                                                                  │
│   366 │                                                                                          │
│   367 │   def __iter__(self):                                                                    │
│   368 │   │   if self.rng_types is not None:                                                     │
│ ❱ 369 │   │   │   synchronize_rng_states(self.rng_types, self.synchronized_generator)            │
│   370 │   │   self.gradient_state._set_end_of_dataloader(False)                                  │
│   371 │   │   # We can safely pass because the default is -1                                     │
│   372 │   │   with suppress(Exception):                                                          │
│                                                                                                  │
│ /data/wenjiaxin/anaconda3/envs/rl/lib/python3.8/site-packages/accelerate/utils/random.py:89 in   │
│ synchronize_rng_states                                                                           │
│                                                                                                  │
│   86                                                                                             │
│   87 def synchronize_rng_states(rng_types: List[Union[str, RNGType]], generator: Optional[tor    │
│   88 │   for rng_type in rng_types:                                                              │
│ ❱ 89 │   │   synchronize_rng_state(RNGType(rng_type), generator=generator)                       │
│   90                                                                                             │
│                                                                                                  │
│ /data/wenjiaxin/anaconda3/envs/rl/lib/python3.8/site-packages/accelerate/utils/random.py:84 in   │
│ synchronize_rng_state                                                                            │
│                                                                                                  │
│   81 │   elif rng_type == RNGType.XLA:                                                           │
│   82 │   │   xm.set_rng_state(rng_state.item())                                                  │
│   83 │   elif rng_type == RNGType.GENERATOR:                                                     │
│ ❱ 84 │   │   generator.set_state(rng_state)                                                      │
│   85                                                                                             │
│   86                                                                                             │
│   87 def synchronize_rng_states(rng_types: List[Union[str, RNGType]], generator: Optional[tor    │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Invalid mt19937 state
[E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=696, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1802763 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=696, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1802485 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:821] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=697, OpType=ALLREDUCE,
Timeout(ms)=1800000) ran for 1803843 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=697, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1803843 milliseconds before timing out.
[23:11:39] WARNING  Sending process 17109 closing signal SIGTERM                                                 api.py:699
[23:12:09] WARNING  Unable to shutdown process 17109 via 15, forcefully exitting via 9                           api.py:716
[23:12:10] ERROR    failed (exitcode: -6) local_rank: 1 (pid: 17110) of binary:                                  api.py:673
                    /data/wenjiaxin/anaconda3/envs/rl/bin/python
I haven't found the root cause of this issue, but here is one modification that I am aware of:
I update the torch version from 1.10.1+cu113 to 1.13.1. And my CUDA version is 11.2
Which trlX version are you using?

main (latest)
Additional system and package information

torch 1.13.1
CarperAI / trlx

PPO training stucks #319

🐛 Describe the bug

Which trlX version are you using?

Additional system and package information