Multi-GPU RL training - Githubissues

ghtaro commented 1 year ago

Hi,

I succeeded in running SFT and RM training in multi-gpu environment.

With the two learnt models, I tried to run RL training again in multi-gpu setup:

4 gpu (g5.x12large)
CUDA11.7
python3.8.10

and with the following script.

deepspeed --include=localhost:0,1,2,3 --master_port 61000 trainer_rl.py \
--configs defaults_rlhf \
--rank_model $REWARD_MODEL \
--sft_model $SFT_MODEL

I modified config_rl.yaml below:

defaults_rlhf:
  dataset:
  eval_size: 4
  rank_model: TODO
  sft_model: TODO
  eval_prompts:
  batch_size: 64
  epochs: 10
  datasets:
    - webgpt:
        val_split: 0.0
        fraction: 1
  cache_dir: .cache
  quantization: false
  seq2seqmodel: false
  output_dir: output
  reward_model_batch_size: 32

debug_rlhf:
  rank_model: /local/home/sanagnos/general/Open-Assistant/model/reward/instructor/facebook/galactica-125m-finetuned/checkpoint-500/
  sft_model: /local/home/sanagnos/general/Open-Assistant/model/model_training/EleutherAI/pythia-70m-deduped-base-finetuned/checkpoint-20/
  batch_size: 2

also modified ppo_config.yaml just to add wandb tracker

train:
  seq_length: 1024
  epochs: 100
  total_steps: 10000
  batch_size: 1

  checkpoint_interval: 10000
  eval_interval: 100

  pipeline: "PromptPipeline"
  trainer: "AcceleratePPOTrainer"

  tracker: "wandb"

model:
  model_path: "lvwerra/gpt2-imdb"
  num_layers_unfrozen: 2

tokenizer:
  tokenizer_path: "gpt2"
  truncation_side: "right"

optimizer:
  name: "adamw"
  kwargs:
    lr: 1.0e-4
    betas: [0.9, 0.95]
    eps: 1.0e-8
    weight_decay: 1.0e-6

scheduler:
  name: "cosine_annealing"
  kwargs:
    T_max: 10000 # train.total_steps
    eta_min: 1.0e-4

method:
  name: "ppoconfig"
  num_rollouts: 16
  chunk_size: 16
  ppo_epochs: 4
  init_kl_coef: 0.05
  target: 6
  horizon: 10000
  gamma: 1
  lam: 0.95
  cliprange: 0.2
  cliprange_value: 0.2
  vf_coef: 1
  scale_reward: False
  ref_mean: null
  ref_std: null
  cliprange_reward: 10
  gen_kwargs:
    max_new_tokens: 40
    top_k: 0
    top_p: 1.0
    do_sample: True

Then, I have got the following error message. It looks like eval_prompts are not properly generated and failed miserably in evaluation...

[rollout 16 / 16]:   0%|          | 0/16 [00:20<?, ?it/s]
[rollout 16 / 16]: 100%|██████████| 16/16 [00:20<00:00,  1.27s/it]
[rollout 16 / 16]: 100%|██████████| 16/16 [00:20<00:00,  1.27s/it]
[RANK 0] Starting training
[RANK 0] Evaluating model

[generation sweep 0/1 | eval batch 0/1]:   0%|          | 0/1 [00:00<?, ?it/s]
[generation sweep 1/1 | eval batch 1/1]:   0%|          | 0/1 [00:00<?, ?it/s]
[generation sweep 1/1 | eval batch 1/1]: 100%|██████████| 1/1 [00:00<00:00,  1.93it/s]
[generation sweep 1/1 | eval batch 1/1]: 100%|██████████| 1/1 [00:00<00:00,  1.93it/s]
[RANK 0] Computing rewards
Traceback (most recent call last):
  File "trainer_rl.py", line 95, in <module>
    trainer = trlx.train(
  File "/databricks/conda/envs/pytorch/lib/python3.8/site-packages/trlx/trlx.py", line 119, in train
    trainer.learn()
  File "/databricks/conda/envs/pytorch/lib/python3.8/site-packages/trlx/trainer/accelerate_base_trainer.py", line 455, in learn
    results = self.evaluate()
  File "/databricks/conda/envs/pytorch/lib/python3.8/site-packages/trlx/trainer/accelerate_base_trainer.py", line 357, in evaluate
    self.reward_fn(
  File "/databricks/conda/envs/pytorch/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "trainer_rl.py", line 69, in rank_model_fn
    inputs = rank_tokenizer(samples, return_tensors="pt", padding=True)
  File "/databricks/conda/envs/pytorch/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2523, in __call__
    encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
  File "/databricks/conda/envs/pytorch/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2609, in _call_one
    return self.batch_encode_plus(
  File "/databricks/conda/envs/pytorch/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2800, in batch_encode_plus
    return self._batch_encode_plus(
  File "/databricks/conda/envs/pytorch/lib/python3.8/site-packages/transformers/tokenization_utils_fast.py", line 462, in _batch_encode_plus
    for key in tokens_and_encodings[0][0].keys():
IndexError: list index out of range
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /dbfs/FileStore/tables/JDD/user/imoto/chatgpt2/model/model_training/trainer_ │
│ rl.py:95 in <module>                                                         │
│                                                                              │
│    92 │   trlx_config.model.model_path = training_conf.sft_model             │
│    93 │   trlx_config.train.batch_size = training_conf.batch_size            │
│    94 │                                                                      │
│ ❱  95 │   trainer = trlx.train(                                              │
│    96 │   │   training_conf.sft_model,                                       │
│    97 │   │   reward_fn=rank_model_fn,                                       │
│    98 │   │   prompts=prompts,                                               │
│                                                                              │
│ /databricks/conda/envs/pytorch/lib/python3.8/site-packages/trlx/trlx.py:119  │
│ in train                                                                     │
│                                                                              │
│   116 │   eval_pipeline = get_pipeline(config.train.pipeline)(eval_prompts,  │
│   117 │   trainer.add_eval_pipeline(eval_pipeline)                           │
│   118 │                                                                      │
│ ❱ 119 │   trainer.learn()                                                    │
│   120 │   return trainer                                                     │
│   121                                                                        │
│                                                                              │
│ /databricks/conda/envs/pytorch/lib/python3.8/site-packages/trlx/trainer/acce │
│ lerate_base_trainer.py:455 in learn                                          │
│                                                                              │
│   452 │   │   │   │   │   │   state = json.load(f)                           │
│   453 │   │   │   │   │   │   self.iter_count = state["iter_count"]          │
│   454 │   │   else:                                                          │
│ ❱ 455 │   │   │   results = self.evaluate()                                  │
│   456 │   │   │   self.accelerator.log(results, step=self.iter_count)        │
│   457 │   │                                                                  │
│   458 │   │   tbar = logging.tqdm(                                           │
│                                                                              │
│ /databricks/conda/envs/pytorch/lib/python3.8/site-packages/trlx/trainer/acce │
│ lerate_base_trainer.py:357 in evaluate                                       │
│                                                                              │
│   354 │   │   │   │   if self.reward_fn:                                     │
│   355 │   │   │   │   │   logger.info("Computing rewards")                   │
│   356 │   │   │   │   │   rewards = torch.tensor(                            │
│ ❱ 357 │   │   │   │   │   │   self.reward_fn(                                │
│   358 │   │   │   │   │   │   │   samples=str_samples,                       │
│   359 │   │   │   │   │   │   │   prompts=str_prompts,                       │
│   360 │   │   │   │   │   │   │   outputs=str_outputs,                       │
│                                                                              │
│ /databricks/conda/envs/pytorch/lib/python3.8/site-packages/torch/autograd/gr │
│ ad_mode.py:27 in decorate_context                                            │
│                                                                              │
│    24 │   │   @functools.wraps(func)                                         │
│    25 │   │   def decorate_context(*args, **kwargs):                         │
│    26 │   │   │   with self.clone():                                         │
│ ❱  27 │   │   │   │   return func(*args, **kwargs)                           │
│    28 │   │   return cast(F, decorate_context)                               │
│    29 │                                                                      │
│    30 │   def _wrap_generator(self, func):                                   │
│                                                                              │
│ /dbfs/FileStore/tables/JDD/user/imoto/chatgpt2/model/model_training/trainer_ │
│ rl.py:69 in rank_model_fn                                                    │
│                                                                              │
│    66 │   # TODO sync with reward modelling team on how to do this more tran │
│    67 │   @torch.no_grad()                                                   │
│    68 │   def rank_model_fn(samples, **kwargs):                              │
│ ❱  69 │   │   inputs = rank_tokenizer(samples, return_tensors="pt", padding= │
│    70 │   │   del inputs["token_type_ids"]                                   │
│    71 │   │   return rank_model(**inputs).logits[:, 0].detach().cpu()        │
│    72                                                                        │
│                                                                              │
│ /databricks/conda/envs/pytorch/lib/python3.8/site-packages/transformers/toke │
│ nization_utils_base.py:2523 in __call__                                      │
│                                                                              │
│   2520 │   │   │   # input mode in this case.                                │
│   2521 │   │   │   if not self._in_target_context_manager:                   │
│   2522 │   │   │   │   self._switch_to_input_mode()                          │
│ ❱ 2523 │   │   │   encodings = self._call_one(text=text, text_pair=text_pair │
│   2524 │   │   if text_target is not None:                                   │
│   2525 │   │   │   self._switch_to_target_mode()                             │
│   2526 │   │   │   target_encodings = self._call_one(text=text_target, text_ │
│                                                                              │
│ /databricks/conda/envs/pytorch/lib/python3.8/site-packages/transformers/toke │
│ nization_utils_base.py:2609 in _call_one                                     │
│                                                                              │
│   2606 │   │   │   │   │   f" {len(text_pair)}."                             │
│   2607 │   │   │   │   )                                                     │
│   2608 │   │   │   batch_text_or_text_pairs = list(zip(text, text_pair)) if  │
│ ❱ 2609 │   │   │   return self.batch_encode_plus(                            │
│   2610 │   │   │   │   batch_text_or_text_pairs=batch_text_or_text_pairs,    │
│   2611 │   │   │   │   add_special_tokens=add_special_tokens,                │
│   2612 │   │   │   │   padding=padding,                                      │
│                                                                              │
│ /databricks/conda/envs/pytorch/lib/python3.8/site-packages/transformers/toke │
│ nization_utils_base.py:2800 in batch_encode_plus                             │
│                                                                              │
│   2797 │   │   │   **kwargs,                                                 │
│   2798 │   │   )                                                             │
│   2799 │   │                                                                 │
│ ❱ 2800 │   │   return self._batch_encode_plus(                               │
│   2801 │   │   │   batch_text_or_text_pairs=batch_text_or_text_pairs,        │
│   2802 │   │   │   add_special_tokens=add_special_tokens,                    │
│   2803 │   │   │   padding_strategy=padding_strategy,                        │
│                                                                              │
│ /databricks/conda/envs/pytorch/lib/python3.8/site-packages/transformers/toke │
│ nization_utils_fast.py:462 in _batch_encode_plus                             │
│                                                                              │
│   459 │   │   # To match each overflowing sample with the original sample in │
│   460 │   │   # we add an overflow_to_sample_mapping array (see below)       │
│   461 │   │   sanitized_tokens = {}                                          │
│ ❱ 462 │   │   for key in tokens_and_encodings[0][0].keys():                  │
│   463 │   │   │   stack = [e for item, _ in tokens_and_encodings for e in it │
│   464 │   │   │   sanitized_tokens[key] = stack                              │
│   465 │   │   sanitized_encodings = [e for _, item in tokens_and_encodings f │
╰──────────────────────────────────────────────────────────────────────────────╯
IndexError: list index out of range
wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing.

BTW, I was able to run the RL training with single-gpu.

python trainer_rl.py \
--configs defaults_rlhf \
--rank_model $REWARD_MODEL \
--sft_model $SFT_MODEL

I am stuck for a couple of days already... It would be very helpful if you tell me any advice to sort it out.

sanagno commented 1 year ago

Hi @ghtaro there were some recent changes in the dataset format. Some additional collators and dataset utils are needed most likely. I will try to get back to you by tomorrow the latest.

sanagno commented 1 year ago

Have a look at the rl-training branch

ghtaro commented 1 year ago

Hi @sanagno, thank you very much for the quick support. I had a look at the code and looks fine, but I would like to run the code in my computational environment.

We have two RM trainers one in model/model_training and the other in model/reward/instructor/. Do I have to use the new one (in model_training), better to stick to the old one at the moment?

sanagno commented 1 year ago

Better to switch to the new one in model_training, we might have trouble loading pre-trained models otherwise

ghtaro commented 1 year ago

I have done a quick test.

SFT ran successfully with single node 4GPU, webgpt only and with 1b model.
I could not run RM, because I cannot use py3.10 (because of cloud service limitation I am using ...). oasst code uses a lot | stuff, so I could not modify them manually...
Instead, I was able to run successfully old RM training with webgpt for deberta_v3_base.
Then I tried to run new RL code, but failed due to no support of deberta model...

I will try pythia model for RM and retry RL training with it.

If you have time, it would be great if you support:

more dataset for RM (currently oasst only, right?)
more models for RM and RL.

ghtaro commented 1 year ago

Hi @sanagno , I was able to run new RM model on WebGPT dataset (I added manually).

I am ready to check if RL model runs without errors in multi-GPU setup. Do you have any reasonable setup to run multi-GPU RL learning to reduce gpu memory?

Previously I used deepspeed launcher below, but not sure if it is a good setup.

deepspeed --include=localhost:0,1,2,3 --master_port 61000 trainer_rl.py \
--configs defaults_rlhf \
--rank_model $REWARD_MODEL \
--sft_model $SFT_MODEL

sanagno commented 1 year ago

deepspeed is what I am using as well, seems to work fine for the moment!

ghtaro commented 1 year ago

Just let you know I found a bug in https://github.com/LAION-AI/Open-Assistant/blob/73eb615efb0740f41b284730b3e8bce8aa53ccba/model/model_training/custom_datasets/qa_datasets.py#L204 If mode is rl, it crashes.

ghtaro commented 1 year ago

@sanagno Thanks!

I was wondering if I do deepspeed as I wrote, it does Zero or not. It was my concern. I found accelerator launcher with Zero like below.

accelerate launch \
--config_file configs/default_accelerate_config.yaml \
--num_processes 1 \
--main_process_port 61000 \
trainer_rl.py \
--configs defaults_rlhf pythia_rlhf \
--output_dir $OUT_PATH \

I confirmed that new RL code runs without error both for deepspeed and for accelerator launcers. Next, I will test with 4GPU.

ghtaro commented 1 year ago

Hi,

I failed to run 4GPU RL training with almost same setting as the one in 1GPU. It would be great if you have any idea to sort this out.

[Log with error message]

Few bizarre things:

OMP_NUM_THREADS warning. Should I fix something?
On A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please setpadding_side='left'when initializing the tokenizer., ppo_config sets padding_side: "left". Why do we have this warning? Should I fix something to avoid the error?
I did not have this error in 1GPU. Now I am doing 4GPU with Zero2. Have I done something wrong with Zero2?

[14:28:14] WARNING                                                    run.py:663
                    *****************************************                   
                    Setting OMP_NUM_THREADS environment variable for            
                    each process to be 1 in default, to avoid your              
                    system being overloaded, please further tune the            
                    variable for optimal performance in your                    
                    application as needed.                                      
                    *****************************************                   
Number of trainable parameters: 123M
Number of trainable parameters: 123M
Number of trainable parameters: 123M
Number of trainable parameters: 123M
Found cached dataset webgpt_comparisons (/root/.cache/huggingface/datasets/openai___webgpt_comparisons/default/0.0.0/8b5d5879cdc98c4c0099af6053dffe8d504588d43d3b11f1b1ec223ab1e8db0a)

  0%|          | 0/1 [00:00<?, ?it/s]
100%|██████████| 1/1 [00:00<00:00, 422.64it/s]
Found cached dataset webgpt_comparisons (/root/.cache/huggingface/datasets/openai___webgpt_comparisons/default/0.0.0/8b5d5879cdc98c4c0099af6053dffe8d504588d43d3b11f1b1ec223ab1e8db0a)

  0%|          | 0/1 [00:00<?, ?it/s]
100%|██████████| 1/1 [00:00<00:00, 422.51it/s]
Found cached dataset webgpt_comparisons (/root/.cache/huggingface/datasets/openai___webgpt_comparisons/default/0.0.0/8b5d5879cdc98c4c0099af6053dffe8d504588d43d3b11f1b1ec223ab1e8db0a)

  0%|          | 0/1 [00:00<?, ?it/s]
100%|██████████| 1/1 [00:00<00:00, 417.72it/s]
Found cached dataset webgpt_comparisons (/root/.cache/huggingface/datasets/openai___webgpt_comparisons/default/0.0.0/8b5d5879cdc98c4c0099af6053dffe8d504588d43d3b11f1b1ec223ab1e8db0a)

  0%|          | 0/1 [00:00<?, ?it/s]
100%|██████████| 1/1 [00:00<00:00, 425.82it/s]
[2023-03-28 14:30:07,117] [INFO] [comm.py:654:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[RANK 0] Initializing model: /.../saved_model/checkpoint-200
fatal: not a git repository (or any parent up to mount point /)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
fatal: not a git repository (or any parent up to mount point /)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
fatal: not a git repository (or any parent up to mount point /)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
fatal: not a git repository (or any parent up to mount point /)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
fatal: not a git repository (or any parent up to mount point /)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
wandb: Currently logged in as:.... Use `wandb login --relogin` to force relogin
wandb: wandb version 0.14.0 is available!  To upgrade, please run:
wandb:  $ pip install wandb --upgrade
wandb: Tracking run with wandb version 0.13.7
wandb: Run data is saved locally in /.../model/model_training/wandb/run-20230328_143024-39gzhrxa
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run trainer_rl/checkpoint-200/4gpus:unknown
wandb: ⭐️ View project at https://wandb.ai/llm2/trlx
wandb: 🚀 View run at https://wandb.ai/llm2/trlx/runs/39gzhrxa
[2023-03-28 14:30:34,863] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed info: version=0.7.7, git-hash=unknown, git-branch=unknown
[2023-03-28 14:30:35,532] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2023-03-28 14:30:35,929] [INFO] [logging.py:68:log_dist] [Rank 0] Removing param_group that has no 'params' in the client Optimizer
[2023-03-28 14:30:35,929] [INFO] [logging.py:68:log_dist] [Rank 0] Using client Optimizer as basic optimizer
[2023-03-28 14:30:35,938] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed Basic Optimizer = AdamW
[2023-03-28 14:30:35,938] [INFO] [utils.py:52:is_zero_supported_optimizer] Checking ZeRO support for optimizer=AdamW type=<class 'torch.optim.adamw.AdamW'>
[2023-03-28 14:30:35,938] [INFO] [logging.py:68:log_dist] [Rank 0] Creating fp16 ZeRO stage 2 optimizer
[2023-03-28 14:30:35,939] [INFO] [stage_1_and_2.py:140:__init__] Reduce bucket size 500,000,000
[2023-03-28 14:30:35,939] [INFO] [stage_1_and_2.py:141:__init__] Allgather bucket size 500000000
[2023-03-28 14:30:35,939] [INFO] [stage_1_and_2.py:142:__init__] CPU Offload: True
[2023-03-28 14:30:35,939] [INFO] [stage_1_and_2.py:143:__init__] Round robin gradient partitioning: False
Using /root/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Creating extension directory /root/.cache/torch_extensions/py38_cu117/utils...
Using /root/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Emitting ninja build file /root/.cache/torch_extensions/py38_cu117/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/2] c++ -MMD -MF flatten_unflatten.o.d -DTORCH_EXTENSION_NAME=utils -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /databricks/conda/envs/pytorch/lib/python3.8/site-packages/torch/include -isystem /databricks/conda/envs/pytorch/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /databricks/conda/envs/pytorch/lib/python3.8/site-packages/torch/include/TH -isystem /databricks/conda/envs/pytorch/lib/python3.8/site-packages/torch/include/THC -isystem /databricks/conda/envs/pytorch/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -c /databricks/conda/envs/pytorch/lib/python3.8/site-packages/deepspeed/ops/csrc/utils/flatten_unflatten.cpp -o flatten_unflatten.o 
[2/2] c++ flatten_unflatten.o -shared -L/databricks/conda/envs/pytorch/lib/python3.8/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o utils.so
Loading extension module utils...
Time to load utils op: 16.208775758743286 seconds
Loading extension module utils...
Time to load utils op: 16.23153042793274 seconds
Loading extension module utils...
Time to load utils op: 16.231106758117676 seconds
Loading extension module utils...
Time to load utils op: 16.229982376098633 seconds
Rank: 3 partition count [4] and sizes[(255028226, False)] 
Rank: 2 partition count [4] and sizes[(255028226, False)] 
Rank: 0 partition count [4] and sizes[(255028226, False)] 
Rank: 1 partition count [4] and sizes[(255028226, False)] 
[2023-03-28 14:30:57,454] [INFO] [utils.py:827:see_memory_usage] Before initializing optimizer states
[2023-03-28 14:30:57,455] [INFO] [utils.py:828:see_memory_usage] MA 4.76 GB         Max_MA 4.76 GB         CA 8.22 GB         Max_CA 8 GB 
[2023-03-28 14:30:57,456] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 31.95 GB, percent = 17.1%
Using /root/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0004322528839111328 seconds
Using /root/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.00042057037353515625 seconds
Using /root/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0004220008850097656 seconds
[2023-03-28 14:31:01,200] [INFO] [utils.py:827:see_memory_usage] After initializing optimizer states
[2023-03-28 14:31:01,201] [INFO] [utils.py:828:see_memory_usage] MA 4.76 GB         Max_MA 4.76 GB         CA 8.22 GB         Max_CA 8 GB 
[2023-03-28 14:31:01,201] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 42.62 GB, percent = 22.8%
[2023-03-28 14:31:01,201] [INFO] [stage_1_and_2.py:525:__init__] optimizer state initialized
[2023-03-28 14:31:01,316] [INFO] [utils.py:827:see_memory_usage] After initializing ZeRO optimizer
[2023-03-28 14:31:01,317] [INFO] [utils.py:828:see_memory_usage] MA 4.76 GB         Max_MA 4.76 GB         CA 8.22 GB         Max_CA 8 GB 
[2023-03-28 14:31:01,317] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 42.62 GB, percent = 22.8%
[2023-03-28 14:31:01,319] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed Final Optimizer = AdamW
[2023-03-28 14:31:01,319] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed using client LR scheduler
[2023-03-28 14:31:01,319] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed LR Scheduler = None
[2023-03-28 14:31:01,319] [INFO] [logging.py:68:log_dist] [Rank 0] step=0, skipped=0, lr=[1e-06], mom=[[0.9, 0.95]]
[2023-03-28 14:31:01,320] [INFO] [config.py:1020:print] DeepSpeedEngine configuration:
[2023-03-28 14:31:01,320] [INFO] [config.py:1024:print]   activation_checkpointing_config  {
    "partition_activations": false, 
    "contiguous_memory_optimization": false, 
    "cpu_checkpointing": false, 
    "number_checkpoints": null, 
    "synchronize_checkpoint_boundary": false, 
    "profile": false
}
[2023-03-28 14:31:01,321] [INFO] [config.py:1024:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2023-03-28 14:31:01,321] [INFO] [config.py:1024:print]   amp_enabled .................. False
[2023-03-28 14:31:01,321] [INFO] [config.py:1024:print]   amp_params ................... False
[2023-03-28 14:31:01,321] [INFO] [config.py:1024:print]   autotuning_config ............ {
    "enabled": false, 
    "start_step": null, 
    "end_step": null, 
    "metric_path": null, 
    "arg_mappings": null, 
    "metric": "throughput", 
    "model_info": null, 
    "results_dir": "autotuning_results", 
    "exps_dir": "autotuning_exps", 
    "overwrite": true, 
    "fast": true, 
    "start_profile_step": 3, 
    "end_profile_step": 5, 
    "tuner_type": "gridsearch", 
    "tuner_early_stopping": 5, 
    "tuner_num_trials": 50, 
    "model_info_path": null, 
    "mp_size": 1, 
    "max_train_batch_size": null, 
    "min_train_batch_size": 1, 
    "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
    "min_train_micro_batch_size_per_gpu": 1, 
    "num_tuning_micro_batch_sizes": 3
}
[2023-03-28 14:31:01,321] [INFO] [config.py:1024:print]   bfloat16_enabled ............. False
[2023-03-28 14:31:01,321] [INFO] [config.py:1024:print]   checkpoint_parallel_write_pipeline  False
[2023-03-28 14:31:01,322] [INFO] [config.py:1024:print]   checkpoint_tag_validation_enabled  True
[2023-03-28 14:31:01,322] [INFO] [config.py:1024:print]   checkpoint_tag_validation_fail  False
[2023-03-28 14:31:01,322] [INFO] [config.py:1024:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7fda91d86eb0>
[2023-03-28 14:31:01,322] [INFO] [config.py:1024:print]   communication_data_type ...... None
[2023-03-28 14:31:01,322] [INFO] [config.py:1024:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2023-03-28 14:31:01,322] [INFO] [config.py:1024:print]   curriculum_enabled ........... False
[2023-03-28 14:31:01,322] [INFO] [config.py:1024:print]   curriculum_params ............ False
[2023-03-28 14:31:01,322] [INFO] [config.py:1024:print]   dataloader_drop_last ......... False
[2023-03-28 14:31:01,323] [INFO] [config.py:1024:print]   disable_allgather ............ False
[2023-03-28 14:31:01,323] [INFO] [config.py:1024:print]   dump_state ................... False
[2023-03-28 14:31:01,323] [INFO] [config.py:1024:print]   dynamic_loss_scale_args ...... None
[2023-03-28 14:31:01,323] [INFO] [config.py:1024:print]   eigenvalue_enabled ........... False
[2023-03-28 14:31:01,323] [INFO] [config.py:1024:print]   eigenvalue_gas_boundary_resolution  1
[2023-03-28 14:31:01,323] [INFO] [config.py:1024:print]   eigenvalue_layer_name ........ bert.encoder.layer
[2023-03-28 14:31:01,323] [INFO] [config.py:1024:print]   eigenvalue_layer_num ......... 0
[2023-03-28 14:31:01,323] [INFO] [config.py:1024:print]   eigenvalue_max_iter .......... 100
[2023-03-28 14:31:01,324] [INFO] [config.py:1024:print]   eigenvalue_stability ......... 1e-06
[2023-03-28 14:31:01,324] [INFO] [config.py:1024:print]   eigenvalue_tol ............... 0.01
[2023-03-28 14:31:01,324] [INFO] [config.py:1024:print]   eigenvalue_verbose ........... False
[2023-03-28 14:31:01,324] [INFO] [config.py:1024:print]   elasticity_enabled ........... False
[2023-03-28 14:31:01,324] [INFO] [config.py:1024:print]   flops_profiler_config ........ {
    "enabled": false, 
    "profile_step": 1, 
    "module_depth": -1, 
    "top_modules": 1, 
    "detailed": true, 
    "output_file": null
}
[2023-03-28 14:31:01,324] [INFO] [config.py:1024:print]   fp16_auto_cast ............... None
[2023-03-28 14:31:01,324] [INFO] [config.py:1024:print]   fp16_enabled ................. False
[2023-03-28 14:31:01,324] [INFO] [config.py:1024:print]   fp16_master_weights_and_gradients  False
[2023-03-28 14:31:01,325] [INFO] [config.py:1024:print]   global_rank .................. 0
[2023-03-28 14:31:01,325] [INFO] [config.py:1024:print]   grad_accum_dtype ............. None
[2023-03-28 14:31:01,325] [INFO] [config.py:1024:print]   gradient_accumulation_steps .. 1
[2023-03-28 14:31:01,325] [INFO] [config.py:1024:print]   gradient_clipping ............ 0.0
[2023-03-28 14:31:01,325] [INFO] [config.py:1024:print]   gradient_predivide_factor .... 1.0
[2023-03-28 14:31:01,325] [INFO] [config.py:1024:print]   initial_dynamic_scale ........ 4294967296
[2023-03-28 14:31:01,325] [INFO] [config.py:1024:print]   load_universal_checkpoint .... False
[2023-03-28 14:31:01,325] [INFO] [config.py:1024:print]   loss_scale ................... 0
[2023-03-28 14:31:01,326] [INFO] [config.py:1024:print]   memory_breakdown ............. False
[2023-03-28 14:31:01,326] [INFO] [config.py:1024:print]   monitor_config ............... <deepspeed.monitor.config.DeepSpeedMonitorConfig object at 0x7fda91d86d60>
[2023-03-28 14:31:01,326] [INFO] [config.py:1024:print]   nebula_config ................ {
    "enabled": false, 
    "persistent_storage_path": null, 
    "persistent_time_interval": 100, 
    "num_of_version_in_retention": 2, 
    "enable_nebula_load": true, 
    "load_path": null
}
[2023-03-28 14:31:01,326] [INFO] [config.py:1024:print]   optimizer_legacy_fusion ...... False
[2023-03-28 14:31:01,326] [INFO] [config.py:1024:print]   optimizer_name ............... None
[2023-03-28 14:31:01,326] [INFO] [config.py:1024:print]   optimizer_params ............. None
[2023-03-28 14:31:01,326] [INFO] [config.py:1024:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2023-03-28 14:31:01,326] [INFO] [config.py:1024:print]   pld_enabled .................. False
[2023-03-28 14:31:01,326] [INFO] [config.py:1024:print]   pld_params ................... False
[2023-03-28 14:31:01,327] [INFO] [config.py:1024:print]   prescale_gradients ........... False
[2023-03-28 14:31:01,327] [INFO] [config.py:1024:print]   scheduler_name ............... None
[2023-03-28 14:31:01,327] [INFO] [config.py:1024:print]   scheduler_params ............. None
[2023-03-28 14:31:01,327] [INFO] [config.py:1024:print]   sparse_attention ............. None
[2023-03-28 14:31:01,327] [INFO] [config.py:1024:print]   sparse_gradients_enabled ..... False
[2023-03-28 14:31:01,327] [INFO] [config.py:1024:print]   steps_per_print .............. inf
[2023-03-28 14:31:01,327] [INFO] [config.py:1024:print]   train_batch_size ............. 8
[2023-03-28 14:31:01,327] [INFO] [config.py:1024:print]   train_micro_batch_size_per_gpu  2
[2023-03-28 14:31:01,328] [INFO] [config.py:1024:print]   use_node_local_storage ....... False
[2023-03-28 14:31:01,328] [INFO] [config.py:1024:print]   wall_clock_breakdown ......... False
[2023-03-28 14:31:01,328] [INFO] [config.py:1024:print]   world_size ................... 4
[2023-03-28 14:31:01,328] [INFO] [config.py:1024:print]   zero_allow_untested_optimizer  True
[2023-03-28 14:31:01,328] [INFO] [config.py:1024:print]   zero_config .................. stage=2 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 allgather_partitions=True allgather_bucket_size=500000000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='cpu', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=True) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='cpu', nvme_path=None, buffer_count=4, pin_memory=True, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False) sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False
[2023-03-28 14:31:01,328] [INFO] [config.py:1024:print]   zero_enabled ................. True
[2023-03-28 14:31:01,328] [INFO] [config.py:1024:print]   zero_optimization_stage ...... 2
[2023-03-28 14:31:01,329] [INFO] [config.py:1009:print_user_config]   json = {
    "train_micro_batch_size_per_gpu": 2, 
    "gradient_accumulation_steps": 1, 
    "fp16": {
        "enabled": false, 
        "min_loss_scale": 0.5, 
        "fp16_scale_tolerance": 0.25, 
        "opt_level": "O2", 
        "auto_cast": false
    }, 
    "zero_optimization": {
        "stage": 2, 
        "offload_param": {
            "device": "cpu", 
            "pin_memory": true
        }, 
        "offload_optimizer": {
            "device": "cpu", 
            "pin_memory": true
        }, 
        "allgather_partitions": true, 
        "allgather_bucket_size": 5.000000e+08, 
        "contiguous_gradients": true
    }, 
    "steps_per_print": inf, 
    "zero_allow_untested_optimizer": true
}
Using /root/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0007214546203613281 seconds
[RANK 0] Collecting rollouts

[rollout 0 / 32]:   0%|          | 0/32 [00:00<?, ?it/s]You're using a GPTNeoXTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a GPTNeoXTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a GPTNeoXTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
You're using a GPTNeoXTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
/databricks/conda/envs/pytorch/lib/python3.8/site-packages/trlx/trainer/accelerate_ppo_trainer.py:307: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  scores = torch.tensor(
/databricks/conda/envs/pytorch/lib/python3.8/site-packages/trlx/trainer/accelerate_ppo_trainer.py:307: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  scores = torch.tensor(
/databricks/conda/envs/pytorch/lib/python3.8/site-packages/trlx/trainer/accelerate_ppo_trainer.py:307: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  scores = torch.tensor(
/databricks/conda/envs/pytorch/lib/python3.8/site-packages/trlx/trainer/accelerate_ppo_trainer.py:307: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  scores = torch.tensor(

[rollout 2 / 32]:   0%|          | 0/32 [00:02<?, ?it/s]
[rollout 2 / 32]:   6%|▋         | 2/32 [00:02<00:38,  1.29s/it]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.

[rollout 2 / 32]:   6%|▋         | 2/32 [00:03<00:38,  1.29s/it]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.

[rollout 2 / 32]:   6%|▋         | 2/32 [00:03<00:38,  1.29s/it]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.

[rollout 4 / 32]:   6%|▋         | 2/32 [00:04<00:38,  1.29s/it]
[rollout 4 / 32]:  12%|█▎        | 4/32 [00:04<00:27,  1.04it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.

[rollout 4 / 32]:  12%|█▎        | 4/32 [00:04<00:27,  1.04it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.

*** WARNING: skipped 60243 bytes of output ***

[generation sweep 1/1 | eval batch 40/125]:  31%|███       | 39/125 [00:02<00:06, 13.68it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.

[generation sweep 1/1 | eval batch 41/125]:  32%|███▏      | 40/125 [00:02<00:06, 13.68it/s]
[generation sweep 1/1 | eval batch 41/125]:  33%|███▎      | 41/125 [00:02<00:07, 11.86it/s]
[generation sweep 1/1 | eval batch 42/125]:  33%|███▎      | 41/125 [00:02<00:07, 11.86it/s]
[generation sweep 1/1 | eval batch 43/125]:  34%|███▎      | 42/125 [00:03<00:06, 11.86it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.

[generation sweep 1/1 | eval batch 44/125]:  34%|███▍      | 43/125 [00:03<00:06, 11.86it/s]
[generation sweep 1/1 | eval batch 44/125]:  35%|███▌      | 44/125 [00:03<00:05, 14.19it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.

[generation sweep 1/1 | eval batch 45/125]:  35%|███▌      | 44/125 [00:03<00:05, 14.19it/s]
[generation sweep 1/1 | eval batch 46/125]:  36%|███▌      | 45/125 [00:03<00:05, 14.19it/s]
[generation sweep 1/1 | eval batch 46/125]:  37%|███▋      | 46/125 [00:03<00:05, 15.35it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.

[generation sweep 1/1 | eval batch 47/125]:  37%|███▋      | 46/125 [00:03<00:05, 15.35it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.

[generation sweep 1/1 | eval batch 48/125]:  38%|███▊      | 47/125 [00:03<00:05, 15.35it/s]
[generation sweep 1/1 | eval batch 48/125]:  38%|███▊      | 48/125 [00:03<00:05, 15.31it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.

[generation sweep 1/1 | eval batch 49/125]:  38%|███▊      | 48/125 [00:03<00:05, 15.31it/s]
[generation sweep 1/1 | eval batch 50/125]:  39%|███▉      | 49/125 [00:03<00:04, 15.31it/s]
[generation sweep 1/1 | eval batch 50/125]:  40%|████      | 50/125 [00:03<00:04, 16.33it/s]
[generation sweep 1/1 | eval batch 51/125]:  40%|████      | 50/125 [00:03<00:04, 16.33it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.

[generation sweep 1/1 | eval batch 52/125]:  41%|████      | 51/125 [00:03<00:04, 16.33it/s]
[generation sweep 1/1 | eval batch 52/125]:  42%|████▏     | 52/125 [00:03<00:04, 16.92it/s]
[generation sweep 1/1 | eval batch 53/125]:  42%|████▏     | 52/125 [00:03<00:04, 16.92it/s]
[generation sweep 1/1 | eval batch 54/125]:  42%|████▏     | 53/125 [00:03<00:04, 16.92it/s]
[generation sweep 1/1 | eval batch 54/125]:  43%|████▎     | 54/125 [00:03<00:05, 13.80it/s]
[generation sweep 1/1 | eval batch 55/125]:  43%|████▎     | 54/125 [00:03<00:05, 13.80it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.

[generation sweep 1/1 | eval batch 56/125]:  44%|████▍     | 55/125 [00:03<00:05, 13.80it/s]
[generation sweep 1/1 | eval batch 56/125]:  45%|████▍     | 56/125 [00:03<00:06, 11.29it/s]
[generation sweep 1/1 | eval batch 57/125]:  45%|████▍     | 56/125 [00:04<00:06, 11.29it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.

[generation sweep 1/1 | eval batch 58/125]:  46%|████▌     | 57/125 [00:04<00:06, 11.29it/s]
[generation sweep 1/1 | eval batch 58/125]:  46%|████▋     | 58/125 [00:04<00:05, 12.44it/s]
[generation sweep 1/1 | eval batch 59/125]:  46%|████▋     | 58/125 [00:04<00:05, 12.44it/s]
[generation sweep 1/1 | eval batch 60/125]:  47%|████▋     | 59/125 [00:04<00:05, 12.44it/s]
[generation sweep 1/1 | eval batch 60/125]:  48%|████▊     | 60/125 [00:04<00:04, 13.76it/s]
[generation sweep 1/1 | eval batch 61/125]:  48%|████▊     | 60/125 [00:04<00:04, 13.76it/s]
[generation sweep 1/1 | eval batch 62/125]:  49%|████▉     | 61/125 [00:04<00:04, 13.76it/s]
[generation sweep 1/1 | eval batch 62/125]:  50%|████▉     | 62/125 [00:04<00:04, 14.29it/s]
[generation sweep 1/1 | eval batch 63/125]:  50%|████▉     | 62/125 [00:04<00:04, 14.29it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.

[generation sweep 1/1 | eval batch 64/125]:  50%|█████     | 63/125 [00:04<00:04, 14.29it/s]
[generation sweep 1/1 | eval batch 64/125]:  51%|█████     | 64/125 [00:04<00:03, 15.27it/s]
[generation sweep 1/1 | eval batch 65/125]:  51%|█████     | 64/125 [00:04<00:03, 15.27it/s]
[generation sweep 1/1 | eval batch 66/125]:  52%|█████▏    | 65/125 [00:04<00:03, 15.27it/s]
[generation sweep 1/1 | eval batch 66/125]:  53%|█████▎    | 66/125 [00:04<00:03, 15.71it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.

[generation sweep 1/1 | eval batch 67/125]:  53%|█████▎    | 66/125 [00:04<00:03, 15.71it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.

[generation sweep 1/1 | eval batch 68/125]:  54%|█████▎    | 67/125 [00:04<00:03, 15.71it/s]
[generation sweep 1/1 | eval batch 68/125]:  54%|█████▍    | 68/125 [00:04<00:04, 14.02it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.

[generation sweep 1/1 | eval batch 69/125]:  54%|█████▍    | 68/125 [00:04<00:04, 14.02it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.

[generation sweep 1/1 | eval batch 70/125]:  55%|█████▌    | 69/125 [00:05<00:03, 14.02it/s]
[generation sweep 1/1 | eval batch 70/125]:  56%|█████▌    | 70/125 [00:05<00:04, 11.12it/s]
[generation sweep 1/1 | eval batch 71/125]:  56%|█████▌    | 70/125 [00:05<00:04, 11.12it/s]
[generation sweep 1/1 | eval batch 72/125]:  57%|█████▋    | 71/125 [00:05<00:04, 11.12it/s]
[generation sweep 1/1 | eval batch 73/125]:  58%|█████▊    | 72/125 [00:05<00:04, 11.12it/s]
[generation sweep 1/1 | eval batch 73/125]:  58%|█████▊    | 73/125 [00:05<00:03, 14.26it/s]
[generation sweep 1/1 | eval batch 74/125]:  58%|█████▊    | 73/125 [00:05<00:03, 14.26it/s]
[generation sweep 1/1 | eval batch 75/125]:  59%|█████▉    | 74/125 [00:05<00:03, 14.26it/s]
[generation sweep 1/1 | eval batch 75/125]:  60%|██████    | 75/125 [00:05<00:03, 14.86it/s]
[generation sweep 1/1 | eval batch 76/125]:  60%|██████    | 75/125 [00:05<00:03, 14.86it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.

[generation sweep 1/1 | eval batch 77/125]:  61%|██████    | 76/125 [00:05<00:03, 14.86it/s]
[generation sweep 1/1 | eval batch 77/125]:  62%|██████▏   | 77/125 [00:05<00:03, 15.31it/s]
[generation sweep 1/1 | eval batch 78/125]:  62%|██████▏   | 77/125 [00:05<00:03, 15.31it/s]
[generation sweep 1/1 | eval batch 79/125]:  62%|██████▏   | 78/125 [00:05<00:03, 15.31it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.

[generation sweep 1/1 | eval batch 80/125]:  63%|██████▎   | 79/125 [00:05<00:03, 15.31it/s]
[generation sweep 1/1 | eval batch 80/125]:  64%|██████▍   | 80/125 [00:05<00:02, 16.88it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.

[generation sweep 1/1 | eval batch 81/125]:  64%|██████▍   | 80/125 [00:05<00:02, 16.88it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.

[generation sweep 1/1 | eval batch 82/125]:  65%|██████▍   | 81/125 [00:05<00:02, 16.88it/s]
[generation sweep 1/1 | eval batch 82/125]:  66%|██████▌   | 82/125 [00:05<00:02, 16.36it/s]
[generation sweep 1/1 | eval batch 83/125]:  66%|██████▌   | 82/125 [00:05<00:02, 16.36it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.

[generation sweep 1/1 | eval batch 84/125]:  66%|██████▋   | 83/125 [00:05<00:02, 16.36it/s]
[generation sweep 1/1 | eval batch 84/125]:  67%|██████▋   | 84/125 [00:05<00:02, 14.56it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.

[generation sweep 1/1 | eval batch 85/125]:  67%|██████▋   | 84/125 [00:05<00:02, 14.56it/s]
[generation sweep 1/1 | eval batch 86/125]:  68%|██████▊   | 85/125 [00:06<00:02, 14.56it/s]
[generation sweep 1/1 | eval batch 86/125]:  69%|██████▉   | 86/125 [00:06<00:03, 11.75it/s]
[generation sweep 1/1 | eval batch 87/125]:  69%|██████▉   | 86/125 [00:06<00:03, 11.75it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.

[generation sweep 1/1 | eval batch 88/125]:  70%|██████▉   | 87/125 [00:06<00:03, 11.75it/s]
[generation sweep 1/1 | eval batch 88/125]:  70%|███████   | 88/125 [00:06<00:02, 12.37it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.

[generation sweep 1/1 | eval batch 89/125]:  70%|███████   | 88/125 [00:06<00:02, 12.37it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.

[generation sweep 1/1 | eval batch 90/125]:  71%|███████   | 89/125 [00:06<00:02, 12.37it/s]
[generation sweep 1/1 | eval batch 90/125]:  72%|███████▏  | 90/125 [00:06<00:02, 13.09it/s]
[generation sweep 1/1 | eval batch 91/125]:  72%|███████▏  | 90/125 [00:06<00:02, 13.09it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.

[generation sweep 1/1 | eval batch 92/125]:  73%|███████▎  | 91/125 [00:06<00:02, 13.09it/s]
[generation sweep 1/1 | eval batch 92/125]:  74%|███████▎  | 92/125 [00:06<00:02, 13.97it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.

[generation sweep 1/1 | eval batch 93/125]:  74%|███████▎  | 92/125 [00:06<00:02, 13.97it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.

[generation sweep 1/1 | eval batch 94/125]:  74%|███████▍  | 93/125 [00:06<00:02, 13.97it/s]
[generation sweep 1/1 | eval batch 94/125]:  75%|███████▌  | 94/125 [00:06<00:02, 14.28it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.

[generation sweep 1/1 | eval batch 95/125]:  75%|███████▌  | 94/125 [00:06<00:02, 14.28it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.

[generation sweep 1/1 | eval batch 96/125]:  76%|███████▌  | 95/125 [00:06<00:02, 14.28it/s]
[generation sweep 1/1 | eval batch 96/125]:  77%|███████▋  | 96/125 [00:06<00:01, 14.54it/s]
[generation sweep 1/1 | eval batch 97/125]:  77%|███████▋  | 96/125 [00:06<00:01, 14.54it/s]
[generation sweep 1/1 | eval batch 98/125]:  78%|███████▊  | 97/125 [00:06<00:01, 14.54it/s]
[generation sweep 1/1 | eval batch 98/125]:  78%|███████▊  | 98/125 [00:06<00:01, 14.30it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.

[generation sweep 1/1 | eval batch 99/125]:  78%|███████▊  | 98/125 [00:06<00:01, 14.30it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.

[generation sweep 1/1 | eval batch 100/125]:  79%|███████▉  | 99/125 [00:07<00:01, 14.30it/s]
[generation sweep 1/1 | eval batch 100/125]:  80%|████████  | 100/125 [00:07<00:02, 11.09it/s]
[generation sweep 1/1 | eval batch 101/125]:  80%|████████  | 100/125 [00:07<00:02, 11.09it/s]
[generation sweep 1/1 | eval batch 102/125]:  81%|████████  | 101/125 [00:07<00:02, 11.09it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.

[generation sweep 1/1 | eval batch 103/125]:  82%|████████▏ | 102/125 [00:07<00:02, 11.09it/s]
[generation sweep 1/1 | eval batch 103/125]:  82%|████████▏ | 103/125 [00:07<00:01, 13.09it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.

[generation sweep 1/1 | eval batch 104/125]:  82%|████████▏ | 103/125 [00:07<00:01, 13.09it/s]
[generation sweep 1/1 | eval batch 105/125]:  83%|████████▎ | 104/125 [00:07<00:01, 13.09it/s]
[generation sweep 1/1 | eval batch 105/125]:  84%|████████▍ | 105/125 [00:07<00:01, 14.36it/s]
[generation sweep 1/1 | eval batch 106/125]:  84%|████████▍ | 105/125 [00:07<00:01, 14.36it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.

[generation sweep 1/1 | eval batch 107/125]:  85%|████████▍ | 106/125 [00:07<00:01, 14.36it/s]
[generation sweep 1/1 | eval batch 107/125]:  86%|████████▌ | 107/125 [00:07<00:01, 14.96it/s]
[generation sweep 1/1 | eval batch 108/125]:  86%|████████▌ | 107/125 [00:07<00:01, 14.96it/s]
[generation sweep 1/1 | eval batch 109/125]:  86%|████████▋ | 108/125 [00:07<00:01, 14.96it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.

[generation sweep 1/1 | eval batch 110/125]:  87%|████████▋ | 109/125 [00:07<00:01, 14.96it/s]
[generation sweep 1/1 | eval batch 110/125]:  88%|████████▊ | 110/125 [00:07<00:00, 16.77it/s]
[generation sweep 1/1 | eval batch 111/125]:  88%|████████▊ | 110/125 [00:07<00:00, 16.77it/s]
[generation sweep 1/1 | eval batch 112/125]:  89%|████████▉ | 111/125 [00:07<00:00, 16.77it/s]
[generation sweep 1/1 | eval batch 113/125]:  90%|████████▉ | 112/125 [00:07<00:00, 16.77it/s]
[generation sweep 1/1 | eval batch 113/125]:  90%|█████████ | 113/125 [00:07<00:00, 17.31it/s]
[generation sweep 1/1 | eval batch 114/125]:  90%|█████████ | 113/125 [00:07<00:00, 17.31it/s]
[generation sweep 1/1 | eval batch 115/125]:  91%|█████████ | 114/125 [00:07<00:00, 17.31it/s]
[generation sweep 1/1 | eval batch 115/125]:  92%|█████████▏| 115/125 [00:07<00:00, 16.58it/s]
[generation sweep 1/1 | eval batch 116/125]:  92%|█████████▏| 115/125 [00:08<00:00, 16.58it/s]
[generation sweep 1/1 | eval batch 117/125]:  93%|█████████▎| 116/125 [00:08<00:00, 16.58it/s]
[generation sweep 1/1 | eval batch 117/125]:  94%|█████████▎| 117/125 [00:08<00:00, 14.46it/s]
[generation sweep 1/1 | eval batch 118/125]:  94%|█████████▎| 117/125 [00:08<00:00, 14.46it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.

[generation sweep 1/1 | eval batch 119/125]:  94%|█████████▍| 118/125 [00:08<00:00, 14.46it/s]
[generation sweep 1/1 | eval batch 119/125]:  95%|█████████▌| 119/125 [00:08<00:00, 14.92it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.

[generation sweep 1/1 | eval batch 120/125]:  95%|█████████▌| 119/125 [00:08<00:00, 14.92it/s]
[generation sweep 1/1 | eval batch 121/125]:  96%|█████████▌| 120/125 [00:08<00:00, 14.92it/s]
[generation sweep 1/1 | eval batch 121/125]:  97%|█████████▋| 121/125 [00:08<00:00, 15.71it/s]
[generation sweep 1/1 | eval batch 122/125]:  97%|█████████▋| 121/125 [00:08<00:00, 15.71it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.

[generation sweep 1/1 | eval batch 123/125]:  98%|█████████▊| 122/125 [00:08<00:00, 15.71it/s]
[generation sweep 1/1 | eval batch 123/125]:  98%|█████████▊| 123/125 [00:08<00:00, 16.43it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.

[generation sweep 1/1 | eval batch 124/125]:  98%|█████████▊| 123/125 [00:08<00:00, 16.43it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.

[generation sweep 1/1 | eval batch 125/125]:  99%|█████████▉| 124/125 [00:08<00:00, 16.43it/s]
[generation sweep 1/1 | eval batch 125/125]: 100%|██████████| 125/125 [00:08<00:00, 16.07it/s]
[generation sweep 1/1 | eval batch 125/125]: 100%|██████████| 125/125 [00:08<00:00, 14.47it/s]
[RANK 0] Computing rewards
[RANK 0] Summarizing evaluation
Traceback (most recent call last):
  File "trainer_rl.py", line 119, in <module>
    trainer = trlx.train(
  File "/databricks/conda/envs/pytorch/lib/python3.8/site-packages/trlx/trlx.py", line 119, in train
    trainer.learn()
  File "/databricks/conda/envs/pytorch/lib/python3.8/site-packages/trlx/trainer/accelerate_base_trainer.py", line 455, in learn
    results = self.evaluate()
  File "/databricks/conda/envs/pytorch/lib/python3.8/site-packages/trlx/trainer/accelerate_base_trainer.py", line 410, in evaluate
    table_title += f" {k}: {significant(x)}"
  File "/databricks/conda/envs/pytorch/lib/python3.8/site-packages/trlx/utils/__init__.py", line 35, in significant
    return round(x, ndigits - int(math.floor(math.log10(abs(x)))))
ValueError: cannot convert float NaN to integer
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /.../model/model_training/traine │
│ r_rl.py:119 in <module>                                                      │
│                                                                              │
│   116 │   trlx_config.method.num_rollouts = int(training_conf.num_rollouts)  │
│   117 │   trlx_config.train.epochs = int(training_conf.epochs)               │
│   118 │                                                                      │
│ ❱ 119 │   trainer = trlx.train(                                              │
│   120 │   │   sft_config.model_name,                                         │
│   121 │   │   reward_fn=rank_model_fn,                                       │
│   122 │   │   prompts=prompts,                                               │
│                                                                              │
│ /databricks/conda/envs/pytorch/lib/python3.8/site-packages/trlx/trlx.py:119  │
│ in train                                                                     │
│                                                                              │
│   116 │   eval_pipeline = get_pipeline(config.train.pipeline)(eval_prompts,  │
│   117 │   trainer.add_eval_pipeline(eval_pipeline)                           │
│   118 │                                                                      │
│ ❱ 119 │   trainer.learn()                                                    │
│   120 │   return trainer                                                     │
│   121                                                                        │
│                                                                              │
│ /databricks/conda/envs/pytorch/lib/python3.8/site-packages/trlx/trainer/acce │
│ lerate_base_trainer.py:455 in learn                                          │
│                                                                              │
│   452 │   │   │   │   │   │   state = json.load(f)                           │
│   453 │   │   │   │   │   │   self.iter_count = state["iter_count"]          │
│   454 │   │   else:                                                          │
│ ❱ 455 │   │   │   results = self.evaluate()                                  │
│   456 │   │   │   self.accelerator.log(results, step=self.iter_count)        │
│   457 │   │                                                                  │
│   458 │   │   tbar = logging.tqdm(                                           │
│                                                                              │
│ /databricks/conda/envs/pytorch/lib/python3.8/site-packages/trlx/trainer/acce │
│ lerate_base_trainer.py:410 in evaluate                                       │
│                                                                              │
│   407 │   │   │   table_title = f"Evaluation #{self.nth_evaluation}"         │
│   408 │   │   │   for k, x in stats.items():                                 │
│   409 │   │   │   │   if k.startswith("reward") or k.startswith("metrics"):  │
│ ❱ 410 │   │   │   │   │   table_title += f" {k}: {significant(x)}"           │
│   411 │   │   │                                                              │
│   412 │   │   │   rich_table = Table(*columns, title=table_title, show_lines │
│   413 │   │   │   for ix in range(max(min(3, len(rows)), len(gen_sweep_value │
│                                                                              │
│ /databricks/conda/envs/pytorch/lib/python3.8/site-packages/trlx/utils/__init │
│ __.py:35 in significant                                                      │
│                                                                              │
│    32 │   if not isinstance(x, Number) or x == 0:                            │
│    33 │   │   return x                                                       │
│    34 │                                                                      │
│ ❱  35 │   return round(x, ndigits - int(math.floor(math.log10(abs(x)))))     │
│    36                                                                        │
│    37                                                                        │
│    38 def set_seed(seed: int):                                               │
╰──────────────────────────────────────────────────────────────────────────────╯
ValueError: cannot convert float NaN to integer
wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing.
wandb: 
wandb: Run history:
wandb:         exp_scores/mean ▁
wandb: exp_scores/running_mean ▁
wandb:  exp_scores/running_std ▁
wandb:          exp_scores/std ▁
wandb:            kl_ctl_value ▁
wandb:                time/exp ▁
wandb:       time/exp_generate ▁
wandb:          time/exp_score ▁
wandb: 
wandb: Run summary:
wandb:         exp_scores/mean -0.42778
wandb: exp_scores/running_mean -0.43954
wandb:  exp_scores/running_std 0.0668
wandb:          exp_scores/std 0.05542
wandb:            kl_ctl_value 0.04
wandb:                time/exp 0.60333
wandb:       time/exp_generate 0.35865
wandb:          time/exp_score 0.02291
wandb: 
wandb: Synced trainer_rl/checkpoint-200/4gpus:unknown: https://wandb.ai/llm2/trlx/runs/39gzhrxa
wandb: Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)

I've done the following:

[accelerator launcher]

accelerate launch \
--config_file configs/default_accelerate_config.yaml \
--num_processes 4 \
--main_process_port 61000 \
trainer_rl.py \
--configs defaults_rlhf pythia_rlhf \
--output_dir $OUT_PATH \
--batch_size 1 \
--eval_size 500 \

[default_accelerate_config.yaml]

command_file: null
commands: null
compute_environment: LOCAL_MACHINE
deepspeed_config:
  deepspeed_config_file: configs/ds_config_trlx_gptj_summarize.json
  zero3_init_flag: false
distributed_type: DEEPSPEED
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
gpu_ids: null
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
megatron_lm_config: {}
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_name: null
tpu_zone: null
use_cpu: false

[ds_config_trlx_gptj_summarize.json]

{
  "train_micro_batch_size_per_gpu": 2,
  "gradient_accumulation_steps": 4,
  "fp16": {
    "enabled": false,
    "min_loss_scale": 0.5,
    "fp16_scale_tolerance": 0.25,
    "opt_level": "O2"
  },
  "zero_optimization": {
    "stage": 2,
    "offload_param": {
      "device": "cpu",
      "pin_memory": true
    },
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    },
    "allgather_partitions": true,
    "allgather_bucket_size": 5e8,
    "contiguous_gradients": true
  }
}

[config_rl]

defaults_rlhf:
  datasets:
  batch_size: 1
  chunk_size: 2
  num_rollouts: 32
  epochs: 1
  datasets_extra: []
  cache_dir: .cache
  output_dir: model_rl
  eval_size: 5
  rank_config:
  sft_config:

oasst_export_latin_cyrillic_rlhf:
  datasets:
    - oasst_export:
        lang: "bg,ca,cs,da,de,en,es,fr,hr,hu,it,nl,pl,pt,ro,ru,sl,sr,sv,uk"
        #top_k: 2
        input_file_path: 2023-03-25_oasst_research_ready_synth_labels.jsonl.gz
  sort_by_length: false
  use_custom_sampler: false

pythia_rlhf:
  datasets:
    - webgpt:
        fraction: 0.05
  rank_config:
    is_reward_model: true
    model_name: /.../saved_model_pythia/
    cache_dir: /home/ubuntu/data_cache/
    pooling: last
    residual_dropout: 0.08172424407561013
    use_flash_attention: false
    half: false

  sft_config:
    is_reward_model: false
    model_name: /.../saved_model/checkpoint-200
    cache_dir: /home/ubuntu/data_cache/
    quantization: false
    seq2seqmodel: false
    freeze_layer:
    residual_dropout: 0.1
    use_flash_attention: false
    half: false

  batch_size: 1

debug_rlhf:
  rank_model: pythia_reward_model/checkpoint-50
  sft_model: pythia_sft/checkpoint-10/
  batch_size: 2
  log_dir: test

[ppo_config]

train:
  seq_length: 520
  epochs: 30
  total_steps: 10000
  batch_size: 18
  checkpoint_interval: 2500
  eval_interval: 500
  pipeline: "PromptPipeline"
  trainer: "CustomPPOTrainer"
  tracker: wandb

model:
  model_path:
  num_layers_unfrozen: -1
  model_arch_type: causal

tokenizer:
  tokenizer_path:
  truncation_side: "right"
  padding_side: "left"

optimizer:
  name: "adamw"
  kwargs:
    lr: 1.0e-6
    betas: [0.9, 0.95]
    eps: 1.0e-8
    weight_decay: 1.0e-2

scheduler:
  name: "cosine_annealing"
  kwargs:
    T_max: 100000 # train.total_steps
    eta_min: 1.0e-4

method:
  name: "ppoconfig"
  num_rollouts: 32
  chunk_size: 8
  ppo_epochs: 4
  init_kl_coef: 0.04
  target: 6
  horizon: 10000
  gamma: 1
  lam: 0.95
  cliprange: 0.2
  cliprange_value: 0.2
  vf_coef: 1
  scale_reward: False
  ref_mean: null
  ref_std: null
  cliprange_reward: 10
  gen_kwargs:
    max_new_tokens: 100
    top_k: 0
    top_p: 0.7
    do_sample: True
    temperature: 0.5

sanagno commented 1 year ago

OMP_NUM_THREADS, I think that is fine
decoder-only should be fixed in the last commit
Are you using the trlx version from the requirements? Havent experimented with the newer versions too much.

ghtaro commented 1 year ago

I am using "trlx @ git+https://github.com/CarperAI/trlx.git@b91da7b03d8e9fa0c0d6dce10a8f2611aca3013f" as in pyproject file. The only difference would be python version. mine is python3.8 and remove 3.10 specific part (type | None business in dataset code). As long as I use only webgpt, I think I am ok...

ghtaro commented 1 year ago

Hi @sanagno,

I managed to run RL training with 4GPU without error messages by the following modifications. I just wanted to avoid the "decoder-only ..." error.

It would be very helpful if you tell me whether these changes make sense to you or not.

In https://github.com/LAION-AI/Open-Assistant/blob/73eb615efb0740f41b284730b3e8bce8aa53ccba/model/model_training/utils/utils.py#L194, add padding_side=conf.padding_side
learn sft model (pythia-1b) and rm model (pythia-160m) for my test. "padding_side=left" is added in both the config files.
learn rl model with the above two models.

Also, I could not understand at all why I still have the same warning (decoder-only) in the log even I set padding_side to left for all the models.

[3/30 Edited]

After having a look at some examples in trlx (like https://github.com/CarperAI/trlx/blob/e72f7d1a8008c9a994e9fe465aa4a8a7a1fb3232/examples/summarize_rlhf/trlx_gptj_text_summarization.py#L123), I understand that it is in line with your implementation.

I have not fully understood but I probably made a mistake.

I was able to run 4GPU RL training without any code change from the repo (apart from https://github.com/LAION-AI/Open-Assistant/issues/2140#issuecomment-1486472455).

Here is my setup:

rl-training branch
g5.24xlarge (24*4 VRAM, 384 DRAM), CUDA 11.7, python 3.8
sft: pythia-1b
rm: pythia-1b

Here is my accelerator launcher.

accelerate launch \
--config_file configs/default_accelerate_config.yaml \
--num_processes 4 \
--main_process_port 61000 \
trainer_rl.py \
--configs defaults_rlhf pythia_rlhf \
--output_dir $OUT_PATH \
--batch_size 1 \
--eval_size 50 \
--wandb-entity <YOURS>

I still got "decoder only ... padding_shift=left" warning... , I am going to dig out a bit more.

Thank you very much for your advice.

ghtaro commented 1 year ago

It was too early to conclude...

I ran the same script with eval_size=500 and failed with the following messages...

Traceback (most recent call last):
  File "trainer_rl.py", line 119, in <module>
    trainer = trlx.train(
  File "/databricks/conda/envs/pytorch/lib/python3.8/site-packages/trlx/trlx.py", line 119, in train
    trainer.learn()
  File "/databricks/conda/envs/pytorch/lib/python3.8/site-packages/trlx/trainer/accelerate_base_trainer.py", line 455, in learn
    results = self.evaluate()
  File "/databricks/conda/envs/pytorch/lib/python3.8/site-packages/trlx/trainer/accelerate_base_trainer.py", line 410, in evaluate
    table_title += f" {k}: {significant(x)}"
  File "/databricks/conda/envs/pytorch/lib/python3.8/site-packages/trlx/utils/__init__.py", line 35, in significant
    return round(x, ndigits - int(math.floor(math.log10(abs(x)))))
ValueError: cannot convert float NaN to integer

LAION-AI / Open-Assistant

Multi-GPU RL training #2140