CarperAI / trlx

A repo for distributed training of language models with Reinforcement Learning via Human Feedback (RLHF)
MIT License
4.5k stars 473 forks source link

Training stuck generating rollouts #485

Closed javirandor closed 1 year ago

javirandor commented 1 year ago

๐Ÿ› Describe the bug

This is a follow-up from issue #399.

I am facing this same issue even with the updated code.

I am trying to reproduce the HH fine-tuning example on Alpaca.

trlx.train(
    prompts=prompts,
    eval_prompts=eval_prompts,
    reward_fn=reward_fn,
    config=config,
    stop_sequences=["Human:", "human:", "Assistant:", "assistant:"]
)

My code gets stuck generating the second batch of rollouts 16/64.

I am running the code on a cluster with 8xA100s (80GB) and using a custom reward model. Providing a minimal reproducible example is a bit hard for my current setup. Do you have any pointers that can help me debug this issue?

My accelerate config as taken from the repo

compute_environment: LOCAL_MACHINE
deepspeed_config:
  deepspeed_multinode_launcher: standard
  gradient_accumulation_steps: 1
  gradient_clipping: 1.0
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: false
  zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: no
dynamo_config: {}
fsdp_config: {}
machine_rank: 0
main_training_function: main
megatron_lm_config: {}
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
use_cpu: false

Which trlX version are you using?

Installed from sourced @355c974

Additional system and package information

Python 3.9, transformers==4.28.1, torch==2.0.0, accelerate==0.18.0, deepspeed==0.9.1

javirandor commented 1 year ago

The code gets stuck in the gather_dict function in utils/modeling.py called from make_experience. More specifically, in the line torch.distributed.all_gather_object(objs, obj).

javirandor commented 1 year ago

I could solve the problem by commenting out the calls to the gather_dict function since they were creating metadata that is not useful in my use case. I would be curious to see if there is a cleaner solution to this issue. It seems to be a native torch thing but the solution suggested there did not solve the problem.

javirandor commented 1 year ago

I am now facing timeout in the training loop. The operation self.self.accelerator.backward(loss) in accelerate_base_trainer.py times out. Trace below. I checked all generations are not empty and a reward was computed for each of them

[rollout 64 / 64]: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 64/64 [02:23<00:00,  2.24s/it]
[RANK 0] Starting training
[RANK 0] Evaluating model
[generation sweep 1/1 | eval batch 8/8]: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 8/8 [00:50<00:00,  6.30s/it]
[RANK 0] Computing rewards
[RANK 0] Summarizing evaluation
                                            Evaluation #0 reward/mean: -1.82                                             
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”“
โ”ƒ prompt                                                โ”ƒ output                                               โ”ƒ reward โ”ƒ
โ”กโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”ฉ
...
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
  0%|                                                                                           | 0/6000 [00:00<?, ?it/s]

[2] NCCL INFO Using network Socket
[E ProcessGroupNCCL.cpp:828] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=316, OpType=REDUCE, Timeout(ms)=1800000) ran for 1804999 milliseconds before timing out.
[0] NCCL INFO comm 0x5604f6165540 rank 1 nranks 2 cudaDev 1 busId 81000 - Abort COMPLETE
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=316, OpType=REDUCE, Timeout(ms)=1800000) ran for 1804999 milliseconds before timing out.

If I run on a single process and 2 GPUs: 1 for trainable model and 1 for reward model, I get the following error. Thought it could be useful to identify the problem.

โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Traceback (most recent call last) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ trlx_training.py:153 in <module>                      โ”‚
โ”‚                                                                                                  โ”‚
โ”‚   150 )                                                                                          โ”‚
โ”‚   151                                                                                            โ”‚
โ”‚   152 print("Launching training")                                                                โ”‚
โ”‚ โฑ 153 trlx.train(                                                                                โ”‚
โ”‚   154 โ”‚   prompts=prompts,                                                                       โ”‚
โ”‚   155 โ”‚   eval_prompts=eval_prompts,                                                             โ”‚
โ”‚   156 โ”‚   reward_fn=reward_fn,                                                                   โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ /trlx/trlx/trlx.py:128 in train                                                      โ”‚
โ”‚                                                                                                  โ”‚
โ”‚   125 โ”‚   )                                                                                      โ”‚
โ”‚   126 โ”‚   trainer.add_eval_pipeline(eval_pipeline)                                               โ”‚
โ”‚   127 โ”‚                                                                                          โ”‚
โ”‚ โฑ 128 โ”‚   trainer.learn()                                                                        โ”‚
โ”‚   129 โ”‚   return trainer                                                                         โ”‚
โ”‚   130                                                                                            โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ /trlx/trlx/trainer/accelerate_base_trainer.py:546 in learn                           โ”‚
โ”‚                                                                                                  โ”‚
โ”‚   543 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   forward_time += time()                                         โ”‚
โ”‚   544 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   backward_time -= time()                                        โ”‚
โ”‚   545 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   print("going to backward", os.environ["RANK"])                 โ”‚
โ”‚ โฑ 546 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   self.accelerator.backward(loss)                                โ”‚
โ”‚   547 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   print("loss backwarded")                                       โ”‚
โ”‚   548 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   backward_time += time()                                        โ”‚
โ”‚   549 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   stats_accum.append(stats)                                      โ”‚
โ”‚                                                                                                  โ”‚
โ”‚/miniconda3/envs/trlx/lib/python3.9/site-packages/accelerate/accelerator.py:16 โ”‚
โ”‚ 77 in backward                                                                                   โ”‚
โ”‚                                                                                                  โ”‚
โ”‚   1674 โ”‚   โ”‚   โ”‚   # deepspeed handles loss scaling by gradient_accumulation_steps in its `back  โ”‚
โ”‚   1675 โ”‚   โ”‚   โ”‚   loss = loss / self.gradient_accumulation_steps                                โ”‚
โ”‚   1676 โ”‚   โ”‚   if self.distributed_type == DistributedType.DEEPSPEED:                            โ”‚
โ”‚ โฑ 1677 โ”‚   โ”‚   โ”‚   self.deepspeed_engine_wrapped.backward(loss, **kwargs)                        โ”‚
โ”‚   1678 โ”‚   โ”‚   elif self.distributed_type == DistributedType.MEGATRON_LM:                        โ”‚
โ”‚   1679 โ”‚   โ”‚   โ”‚   return                                                                        โ”‚
โ”‚   1680 โ”‚   โ”‚   elif self.scaler is not None:                                                     โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ /miniconda3/envs/trlx/lib/python3.9/site-packages/accelerate/utils/deepspeed.p โ”‚
โ”‚ y:176 in backward                                                                                โ”‚
โ”‚                                                                                                  โ”‚
โ”‚   173 โ”‚   โ”‚   # - zero grad                                                                      โ”‚
โ”‚   174 โ”‚   โ”‚   # - checking overflow                                                              โ”‚
โ”‚   175 โ”‚   โ”‚   # - lr_scheduler step (only if engine.lr_scheduler is not None)                    โ”‚
โ”‚ โฑ 176 โ”‚   โ”‚   self.engine.step()                                                                 โ”‚
โ”‚   177 โ”‚   โ”‚   # and this plugin overrides the above calls with no-ops when Accelerate runs und   โ”‚
โ”‚   178 โ”‚   โ”‚   # Deepspeed, but allows normal functionality for non-Deepspeed cases thus enabli   โ”‚
โ”‚   179 โ”‚   โ”‚   # training loop that works transparently under many training regimes.              โ”‚
โ”‚                                                                                                  โ”‚
โ”‚/miniconda3/envs/trlx/lib/python3.9/site-packages/deepspeed/runtime/engine.py: โ”‚
โ”‚ 1988 in step                                                                                     โ”‚
โ”‚                                                                                                  โ”‚
โ”‚   1985 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   and self.quantizer.any_precision_switch()):                           โ”‚
โ”‚   1986 โ”‚   โ”‚   โ”‚   โ”‚   self._take_model_step(lr_kwargs, self.block_eigenvalue)                   โ”‚
โ”‚   1987 โ”‚   โ”‚   โ”‚   else:                                                                         โ”‚
โ”‚ โฑ 1988 โ”‚   โ”‚   โ”‚   โ”‚   self._take_model_step(lr_kwargs)                                          โ”‚
โ”‚   1989 โ”‚   โ”‚   โ”‚                                                                                 โ”‚
โ”‚   1990 โ”‚   โ”‚   โ”‚   report_progress = self.global_rank == 0 if self.global_rank else True         โ”‚
โ”‚   1991                                                                                           โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ /miniconda3/envs/trlx/lib/python3.9/site-packages/deepspeed/runtime/engine.py: โ”‚
โ”‚ 1895 in _take_model_step                                                                         โ”‚
โ”‚                                                                                                  โ”‚
โ”‚   1892 โ”‚   โ”‚   โ”‚   โ”‚   # https://nvidia.github.io/apex/advanced.html#gradient-clipping           โ”‚
โ”‚   1893 โ”‚   โ”‚   โ”‚   โ”‚   master_params = amp.master_params(self.optimizer)                         โ”‚
โ”‚   1894 โ”‚   โ”‚   โ”‚   โ”‚   clip_grad_norm_(parameters=master_params, max_norm=self.gradient_clippin  โ”‚
โ”‚ โฑ 1895 โ”‚   โ”‚   self.optimizer.step()                                                             โ”‚
โ”‚   1896 โ”‚   โ”‚                                                                                     โ”‚
โ”‚   1897 โ”‚   โ”‚   if hasattr(self.optimizer, '_global_grad_norm'):                                  โ”‚
โ”‚   1898 โ”‚   โ”‚   โ”‚   self._global_grad_norm = self.optimizer._global_grad_norm                     โ”‚
โ”‚                                                                                                  โ”‚
โ”‚/miniconda3/envs/trlx/lib/python3.9/site-packages/deepspeed/runtime/zero/stage โ”‚
โ”‚ _1_and_2.py:1702 in step                                                                         โ”‚
โ”‚                                                                                                  โ”‚
โ”‚   1699 โ”‚   โ”‚   โ”‚   โ”‚   # create a flat gradients for parameters updated by this process          โ”‚
โ”‚   1700 โ”‚   โ”‚   โ”‚   โ”‚   # If we are last partition, ensure we have same size grads and partition  โ”‚
โ”‚   1701 โ”‚   โ”‚   โ”‚   โ”‚   if partition_id == dist.get_world_size(group=self.real_dp_process_group[  โ”‚
โ”‚ โฑ 1702 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   single_grad_partition = self.flatten_dense_tensors_aligned(           โ”‚
โ”‚   1703 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   self.averaged_gradients[i],                                       โ”‚
โ”‚   1704 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   int(self.partition_size[i])).to(self.single_partition_of_fp32_gr  โ”‚
โ”‚   1705 โ”‚   โ”‚   โ”‚   โ”‚   else:                                                                     โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ /miniconda3/envs/trlx/lib/python3.9/site-packages/deepspeed/runtime/zero/stage โ”‚
โ”‚ _1_and_2.py:824 in flatten_dense_tensors_aligned                                                 โ”‚
โ”‚                                                                                                  โ”‚
โ”‚    821 โ”‚                                                                                         โ”‚
โ”‚    822 โ”‚   # create a flat tensor aligned at the alignment boundary                              โ”‚
โ”‚    823 โ”‚   def flatten_dense_tensors_aligned(self, tensor_list, alignment):                      โ”‚
โ”‚ โฑ  824 โ”‚   โ”‚   return self.flatten(align_dense_tensors(tensor_list, alignment))                  โ”‚
โ”‚    825 โ”‚                                                                                         โ”‚
โ”‚    826 โ”‚   ############### Independent Partition Gradient ########################               โ”‚
โ”‚    827 โ”‚   def reduce_independent_p_g_buckets_and_remove_grads(self, param, i):                  โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when 
checking argument for argument tensors in method wrapper_CUDA_cat)
maxreciprocate commented 1 year ago

That's a peculiar issue, does it these errors also occur when using example scripts and/or example reward models and base models? And if could you share your launching script, that might be also helpful.

Dahoas commented 1 year ago

@javirandor Any update on this? Does this also happen with example scripts?

maxreciprocate commented 1 year ago

Was not able to reproduce the error, but please do retry with most recent code if the issue still relevant.