Closed ghtaro closed 1 year ago
Hi @ghtaro there were some recent changes in the dataset format. Some additional collators and dataset utils are needed most likely. I will try to get back to you by tomorrow the latest.
Have a look at the rl-training branch
Hi @sanagno, thank you very much for the quick support. I had a look at the code and looks fine, but I would like to run the code in my computational environment.
We have two RM trainers one in model/model_training and the other in model/reward/instructor/. Do I have to use the new one (in model_training), better to stick to the old one at the moment?
Better to switch to the new one in model_training, we might have trouble loading pre-trained models otherwise
I have done a quick test.
|
stuff, so I could not modify them manually...I will try pythia model for RM and retry RL training with it.
If you have time, it would be great if you support:
Hi @sanagno , I was able to run new RM model on WebGPT dataset (I added manually).
I am ready to check if RL model runs without errors in multi-GPU setup. Do you have any reasonable setup to run multi-GPU RL learning to reduce gpu memory?
Previously I used deepspeed launcher below, but not sure if it is a good setup.
deepspeed --include=localhost:0,1,2,3 --master_port 61000 trainer_rl.py \
--configs defaults_rlhf \
--rank_model $REWARD_MODEL \
--sft_model $SFT_MODEL
deepspeed is what I am using as well, seems to work fine for the moment!
Just let you know I found a bug in https://github.com/LAION-AI/Open-Assistant/blob/73eb615efb0740f41b284730b3e8bce8aa53ccba/model/model_training/custom_datasets/qa_datasets.py#L204 If mode is rl, it crashes.
@sanagno Thanks!
I was wondering if I do deepspeed as I wrote, it does Zero or not. It was my concern. I found accelerator launcher with Zero like below.
accelerate launch \
--config_file configs/default_accelerate_config.yaml \
--num_processes 1 \
--main_process_port 61000 \
trainer_rl.py \
--configs defaults_rlhf pythia_rlhf \
--output_dir $OUT_PATH \
I confirmed that new RL code runs without error both for deepspeed and for accelerator launcers. Next, I will test with 4GPU.
Hi,
I failed to run 4GPU RL training with almost same setting as the one in 1GPU. It would be great if you have any idea to sort this out.
[Log with error message]
Few bizarre things:
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set
padding_side='left'when initializing the tokenizer.
, ppo_config sets padding_side: "left"
. Why do we have this warning? Should I fix something to avoid the error?[14:28:14] WARNING run.py:663
*****************************************
Setting OMP_NUM_THREADS environment variable for
each process to be 1 in default, to avoid your
system being overloaded, please further tune the
variable for optimal performance in your
application as needed.
*****************************************
Number of trainable parameters: 123M
Number of trainable parameters: 123M
Number of trainable parameters: 123M
Number of trainable parameters: 123M
Found cached dataset webgpt_comparisons (/root/.cache/huggingface/datasets/openai___webgpt_comparisons/default/0.0.0/8b5d5879cdc98c4c0099af6053dffe8d504588d43d3b11f1b1ec223ab1e8db0a)
0%| | 0/1 [00:00<?, ?it/s]
100%|██████████| 1/1 [00:00<00:00, 422.64it/s]
Found cached dataset webgpt_comparisons (/root/.cache/huggingface/datasets/openai___webgpt_comparisons/default/0.0.0/8b5d5879cdc98c4c0099af6053dffe8d504588d43d3b11f1b1ec223ab1e8db0a)
0%| | 0/1 [00:00<?, ?it/s]
100%|██████████| 1/1 [00:00<00:00, 422.51it/s]
Found cached dataset webgpt_comparisons (/root/.cache/huggingface/datasets/openai___webgpt_comparisons/default/0.0.0/8b5d5879cdc98c4c0099af6053dffe8d504588d43d3b11f1b1ec223ab1e8db0a)
0%| | 0/1 [00:00<?, ?it/s]
100%|██████████| 1/1 [00:00<00:00, 417.72it/s]
Found cached dataset webgpt_comparisons (/root/.cache/huggingface/datasets/openai___webgpt_comparisons/default/0.0.0/8b5d5879cdc98c4c0099af6053dffe8d504588d43d3b11f1b1ec223ab1e8db0a)
0%| | 0/1 [00:00<?, ?it/s]
100%|██████████| 1/1 [00:00<00:00, 425.82it/s]
[2023-03-28 14:30:07,117] [INFO] [comm.py:654:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[RANK 0] Initializing model: /.../saved_model/checkpoint-200
fatal: not a git repository (or any parent up to mount point /)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
fatal: not a git repository (or any parent up to mount point /)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
fatal: not a git repository (or any parent up to mount point /)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
fatal: not a git repository (or any parent up to mount point /)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
fatal: not a git repository (or any parent up to mount point /)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
wandb: Currently logged in as:.... Use `wandb login --relogin` to force relogin
wandb: wandb version 0.14.0 is available! To upgrade, please run:
wandb: $ pip install wandb --upgrade
wandb: Tracking run with wandb version 0.13.7
wandb: Run data is saved locally in /.../model/model_training/wandb/run-20230328_143024-39gzhrxa
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run trainer_rl/checkpoint-200/4gpus:unknown
wandb: ⭐️ View project at https://wandb.ai/llm2/trlx
wandb: 🚀 View run at https://wandb.ai/llm2/trlx/runs/39gzhrxa
[2023-03-28 14:30:34,863] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed info: version=0.7.7, git-hash=unknown, git-branch=unknown
[2023-03-28 14:30:35,532] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2023-03-28 14:30:35,929] [INFO] [logging.py:68:log_dist] [Rank 0] Removing param_group that has no 'params' in the client Optimizer
[2023-03-28 14:30:35,929] [INFO] [logging.py:68:log_dist] [Rank 0] Using client Optimizer as basic optimizer
[2023-03-28 14:30:35,938] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed Basic Optimizer = AdamW
[2023-03-28 14:30:35,938] [INFO] [utils.py:52:is_zero_supported_optimizer] Checking ZeRO support for optimizer=AdamW type=<class 'torch.optim.adamw.AdamW'>
[2023-03-28 14:30:35,938] [INFO] [logging.py:68:log_dist] [Rank 0] Creating fp16 ZeRO stage 2 optimizer
[2023-03-28 14:30:35,939] [INFO] [stage_1_and_2.py:140:__init__] Reduce bucket size 500,000,000
[2023-03-28 14:30:35,939] [INFO] [stage_1_and_2.py:141:__init__] Allgather bucket size 500000000
[2023-03-28 14:30:35,939] [INFO] [stage_1_and_2.py:142:__init__] CPU Offload: True
[2023-03-28 14:30:35,939] [INFO] [stage_1_and_2.py:143:__init__] Round robin gradient partitioning: False
Using /root/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Creating extension directory /root/.cache/torch_extensions/py38_cu117/utils...
Using /root/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Emitting ninja build file /root/.cache/torch_extensions/py38_cu117/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/2] c++ -MMD -MF flatten_unflatten.o.d -DTORCH_EXTENSION_NAME=utils -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /databricks/conda/envs/pytorch/lib/python3.8/site-packages/torch/include -isystem /databricks/conda/envs/pytorch/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /databricks/conda/envs/pytorch/lib/python3.8/site-packages/torch/include/TH -isystem /databricks/conda/envs/pytorch/lib/python3.8/site-packages/torch/include/THC -isystem /databricks/conda/envs/pytorch/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -c /databricks/conda/envs/pytorch/lib/python3.8/site-packages/deepspeed/ops/csrc/utils/flatten_unflatten.cpp -o flatten_unflatten.o
[2/2] c++ flatten_unflatten.o -shared -L/databricks/conda/envs/pytorch/lib/python3.8/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o utils.so
Loading extension module utils...
Time to load utils op: 16.208775758743286 seconds
Loading extension module utils...
Time to load utils op: 16.23153042793274 seconds
Loading extension module utils...
Time to load utils op: 16.231106758117676 seconds
Loading extension module utils...
Time to load utils op: 16.229982376098633 seconds
Rank: 3 partition count [4] and sizes[(255028226, False)]
Rank: 2 partition count [4] and sizes[(255028226, False)]
Rank: 0 partition count [4] and sizes[(255028226, False)]
Rank: 1 partition count [4] and sizes[(255028226, False)]
[2023-03-28 14:30:57,454] [INFO] [utils.py:827:see_memory_usage] Before initializing optimizer states
[2023-03-28 14:30:57,455] [INFO] [utils.py:828:see_memory_usage] MA 4.76 GB Max_MA 4.76 GB CA 8.22 GB Max_CA 8 GB
[2023-03-28 14:30:57,456] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory: used = 31.95 GB, percent = 17.1%
Using /root/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0004322528839111328 seconds
Using /root/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.00042057037353515625 seconds
Using /root/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0004220008850097656 seconds
[2023-03-28 14:31:01,200] [INFO] [utils.py:827:see_memory_usage] After initializing optimizer states
[2023-03-28 14:31:01,201] [INFO] [utils.py:828:see_memory_usage] MA 4.76 GB Max_MA 4.76 GB CA 8.22 GB Max_CA 8 GB
[2023-03-28 14:31:01,201] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory: used = 42.62 GB, percent = 22.8%
[2023-03-28 14:31:01,201] [INFO] [stage_1_and_2.py:525:__init__] optimizer state initialized
[2023-03-28 14:31:01,316] [INFO] [utils.py:827:see_memory_usage] After initializing ZeRO optimizer
[2023-03-28 14:31:01,317] [INFO] [utils.py:828:see_memory_usage] MA 4.76 GB Max_MA 4.76 GB CA 8.22 GB Max_CA 8 GB
[2023-03-28 14:31:01,317] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory: used = 42.62 GB, percent = 22.8%
[2023-03-28 14:31:01,319] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed Final Optimizer = AdamW
[2023-03-28 14:31:01,319] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed using client LR scheduler
[2023-03-28 14:31:01,319] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed LR Scheduler = None
[2023-03-28 14:31:01,319] [INFO] [logging.py:68:log_dist] [Rank 0] step=0, skipped=0, lr=[1e-06], mom=[[0.9, 0.95]]
[2023-03-28 14:31:01,320] [INFO] [config.py:1020:print] DeepSpeedEngine configuration:
[2023-03-28 14:31:01,320] [INFO] [config.py:1024:print] activation_checkpointing_config {
"partition_activations": false,
"contiguous_memory_optimization": false,
"cpu_checkpointing": false,
"number_checkpoints": null,
"synchronize_checkpoint_boundary": false,
"profile": false
}
[2023-03-28 14:31:01,321] [INFO] [config.py:1024:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2023-03-28 14:31:01,321] [INFO] [config.py:1024:print] amp_enabled .................. False
[2023-03-28 14:31:01,321] [INFO] [config.py:1024:print] amp_params ................... False
[2023-03-28 14:31:01,321] [INFO] [config.py:1024:print] autotuning_config ............ {
"enabled": false,
"start_step": null,
"end_step": null,
"metric_path": null,
"arg_mappings": null,
"metric": "throughput",
"model_info": null,
"results_dir": "autotuning_results",
"exps_dir": "autotuning_exps",
"overwrite": true,
"fast": true,
"start_profile_step": 3,
"end_profile_step": 5,
"tuner_type": "gridsearch",
"tuner_early_stopping": 5,
"tuner_num_trials": 50,
"model_info_path": null,
"mp_size": 1,
"max_train_batch_size": null,
"min_train_batch_size": 1,
"max_train_micro_batch_size_per_gpu": 1.024000e+03,
"min_train_micro_batch_size_per_gpu": 1,
"num_tuning_micro_batch_sizes": 3
}
[2023-03-28 14:31:01,321] [INFO] [config.py:1024:print] bfloat16_enabled ............. False
[2023-03-28 14:31:01,321] [INFO] [config.py:1024:print] checkpoint_parallel_write_pipeline False
[2023-03-28 14:31:01,322] [INFO] [config.py:1024:print] checkpoint_tag_validation_enabled True
[2023-03-28 14:31:01,322] [INFO] [config.py:1024:print] checkpoint_tag_validation_fail False
[2023-03-28 14:31:01,322] [INFO] [config.py:1024:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7fda91d86eb0>
[2023-03-28 14:31:01,322] [INFO] [config.py:1024:print] communication_data_type ...... None
[2023-03-28 14:31:01,322] [INFO] [config.py:1024:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2023-03-28 14:31:01,322] [INFO] [config.py:1024:print] curriculum_enabled ........... False
[2023-03-28 14:31:01,322] [INFO] [config.py:1024:print] curriculum_params ............ False
[2023-03-28 14:31:01,322] [INFO] [config.py:1024:print] dataloader_drop_last ......... False
[2023-03-28 14:31:01,323] [INFO] [config.py:1024:print] disable_allgather ............ False
[2023-03-28 14:31:01,323] [INFO] [config.py:1024:print] dump_state ................... False
[2023-03-28 14:31:01,323] [INFO] [config.py:1024:print] dynamic_loss_scale_args ...... None
[2023-03-28 14:31:01,323] [INFO] [config.py:1024:print] eigenvalue_enabled ........... False
[2023-03-28 14:31:01,323] [INFO] [config.py:1024:print] eigenvalue_gas_boundary_resolution 1
[2023-03-28 14:31:01,323] [INFO] [config.py:1024:print] eigenvalue_layer_name ........ bert.encoder.layer
[2023-03-28 14:31:01,323] [INFO] [config.py:1024:print] eigenvalue_layer_num ......... 0
[2023-03-28 14:31:01,323] [INFO] [config.py:1024:print] eigenvalue_max_iter .......... 100
[2023-03-28 14:31:01,324] [INFO] [config.py:1024:print] eigenvalue_stability ......... 1e-06
[2023-03-28 14:31:01,324] [INFO] [config.py:1024:print] eigenvalue_tol ............... 0.01
[2023-03-28 14:31:01,324] [INFO] [config.py:1024:print] eigenvalue_verbose ........... False
[2023-03-28 14:31:01,324] [INFO] [config.py:1024:print] elasticity_enabled ........... False
[2023-03-28 14:31:01,324] [INFO] [config.py:1024:print] flops_profiler_config ........ {
"enabled": false,
"profile_step": 1,
"module_depth": -1,
"top_modules": 1,
"detailed": true,
"output_file": null
}
[2023-03-28 14:31:01,324] [INFO] [config.py:1024:print] fp16_auto_cast ............... None
[2023-03-28 14:31:01,324] [INFO] [config.py:1024:print] fp16_enabled ................. False
[2023-03-28 14:31:01,324] [INFO] [config.py:1024:print] fp16_master_weights_and_gradients False
[2023-03-28 14:31:01,325] [INFO] [config.py:1024:print] global_rank .................. 0
[2023-03-28 14:31:01,325] [INFO] [config.py:1024:print] grad_accum_dtype ............. None
[2023-03-28 14:31:01,325] [INFO] [config.py:1024:print] gradient_accumulation_steps .. 1
[2023-03-28 14:31:01,325] [INFO] [config.py:1024:print] gradient_clipping ............ 0.0
[2023-03-28 14:31:01,325] [INFO] [config.py:1024:print] gradient_predivide_factor .... 1.0
[2023-03-28 14:31:01,325] [INFO] [config.py:1024:print] initial_dynamic_scale ........ 4294967296
[2023-03-28 14:31:01,325] [INFO] [config.py:1024:print] load_universal_checkpoint .... False
[2023-03-28 14:31:01,325] [INFO] [config.py:1024:print] loss_scale ................... 0
[2023-03-28 14:31:01,326] [INFO] [config.py:1024:print] memory_breakdown ............. False
[2023-03-28 14:31:01,326] [INFO] [config.py:1024:print] monitor_config ............... <deepspeed.monitor.config.DeepSpeedMonitorConfig object at 0x7fda91d86d60>
[2023-03-28 14:31:01,326] [INFO] [config.py:1024:print] nebula_config ................ {
"enabled": false,
"persistent_storage_path": null,
"persistent_time_interval": 100,
"num_of_version_in_retention": 2,
"enable_nebula_load": true,
"load_path": null
}
[2023-03-28 14:31:01,326] [INFO] [config.py:1024:print] optimizer_legacy_fusion ...... False
[2023-03-28 14:31:01,326] [INFO] [config.py:1024:print] optimizer_name ............... None
[2023-03-28 14:31:01,326] [INFO] [config.py:1024:print] optimizer_params ............. None
[2023-03-28 14:31:01,326] [INFO] [config.py:1024:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2023-03-28 14:31:01,326] [INFO] [config.py:1024:print] pld_enabled .................. False
[2023-03-28 14:31:01,326] [INFO] [config.py:1024:print] pld_params ................... False
[2023-03-28 14:31:01,327] [INFO] [config.py:1024:print] prescale_gradients ........... False
[2023-03-28 14:31:01,327] [INFO] [config.py:1024:print] scheduler_name ............... None
[2023-03-28 14:31:01,327] [INFO] [config.py:1024:print] scheduler_params ............. None
[2023-03-28 14:31:01,327] [INFO] [config.py:1024:print] sparse_attention ............. None
[2023-03-28 14:31:01,327] [INFO] [config.py:1024:print] sparse_gradients_enabled ..... False
[2023-03-28 14:31:01,327] [INFO] [config.py:1024:print] steps_per_print .............. inf
[2023-03-28 14:31:01,327] [INFO] [config.py:1024:print] train_batch_size ............. 8
[2023-03-28 14:31:01,327] [INFO] [config.py:1024:print] train_micro_batch_size_per_gpu 2
[2023-03-28 14:31:01,328] [INFO] [config.py:1024:print] use_node_local_storage ....... False
[2023-03-28 14:31:01,328] [INFO] [config.py:1024:print] wall_clock_breakdown ......... False
[2023-03-28 14:31:01,328] [INFO] [config.py:1024:print] world_size ................... 4
[2023-03-28 14:31:01,328] [INFO] [config.py:1024:print] zero_allow_untested_optimizer True
[2023-03-28 14:31:01,328] [INFO] [config.py:1024:print] zero_config .................. stage=2 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 allgather_partitions=True allgather_bucket_size=500000000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='cpu', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=True) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='cpu', nvme_path=None, buffer_count=4, pin_memory=True, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False) sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False
[2023-03-28 14:31:01,328] [INFO] [config.py:1024:print] zero_enabled ................. True
[2023-03-28 14:31:01,328] [INFO] [config.py:1024:print] zero_optimization_stage ...... 2
[2023-03-28 14:31:01,329] [INFO] [config.py:1009:print_user_config] json = {
"train_micro_batch_size_per_gpu": 2,
"gradient_accumulation_steps": 1,
"fp16": {
"enabled": false,
"min_loss_scale": 0.5,
"fp16_scale_tolerance": 0.25,
"opt_level": "O2",
"auto_cast": false
},
"zero_optimization": {
"stage": 2,
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"allgather_partitions": true,
"allgather_bucket_size": 5.000000e+08,
"contiguous_gradients": true
},
"steps_per_print": inf,
"zero_allow_untested_optimizer": true
}
Using /root/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0007214546203613281 seconds
[RANK 0] Collecting rollouts
[rollout 0 / 32]: 0%| | 0/32 [00:00<?, ?it/s]You're using a GPTNeoXTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a GPTNeoXTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a GPTNeoXTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
You're using a GPTNeoXTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
/databricks/conda/envs/pytorch/lib/python3.8/site-packages/trlx/trainer/accelerate_ppo_trainer.py:307: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
scores = torch.tensor(
/databricks/conda/envs/pytorch/lib/python3.8/site-packages/trlx/trainer/accelerate_ppo_trainer.py:307: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
scores = torch.tensor(
/databricks/conda/envs/pytorch/lib/python3.8/site-packages/trlx/trainer/accelerate_ppo_trainer.py:307: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
scores = torch.tensor(
/databricks/conda/envs/pytorch/lib/python3.8/site-packages/trlx/trainer/accelerate_ppo_trainer.py:307: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
scores = torch.tensor(
[rollout 2 / 32]: 0%| | 0/32 [00:02<?, ?it/s]
[rollout 2 / 32]: 6%|▋ | 2/32 [00:02<00:38, 1.29s/it]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
[rollout 2 / 32]: 6%|▋ | 2/32 [00:03<00:38, 1.29s/it]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
[rollout 2 / 32]: 6%|▋ | 2/32 [00:03<00:38, 1.29s/it]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
[rollout 4 / 32]: 6%|▋ | 2/32 [00:04<00:38, 1.29s/it]
[rollout 4 / 32]: 12%|█▎ | 4/32 [00:04<00:27, 1.04it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
[rollout 4 / 32]: 12%|█▎ | 4/32 [00:04<00:27, 1.04it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
*** WARNING: skipped 60243 bytes of output ***
[generation sweep 1/1 | eval batch 40/125]: 31%|███ | 39/125 [00:02<00:06, 13.68it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
[generation sweep 1/1 | eval batch 41/125]: 32%|███▏ | 40/125 [00:02<00:06, 13.68it/s]
[generation sweep 1/1 | eval batch 41/125]: 33%|███▎ | 41/125 [00:02<00:07, 11.86it/s]
[generation sweep 1/1 | eval batch 42/125]: 33%|███▎ | 41/125 [00:02<00:07, 11.86it/s]
[generation sweep 1/1 | eval batch 43/125]: 34%|███▎ | 42/125 [00:03<00:06, 11.86it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
[generation sweep 1/1 | eval batch 44/125]: 34%|███▍ | 43/125 [00:03<00:06, 11.86it/s]
[generation sweep 1/1 | eval batch 44/125]: 35%|███▌ | 44/125 [00:03<00:05, 14.19it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
[generation sweep 1/1 | eval batch 45/125]: 35%|███▌ | 44/125 [00:03<00:05, 14.19it/s]
[generation sweep 1/1 | eval batch 46/125]: 36%|███▌ | 45/125 [00:03<00:05, 14.19it/s]
[generation sweep 1/1 | eval batch 46/125]: 37%|███▋ | 46/125 [00:03<00:05, 15.35it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
[generation sweep 1/1 | eval batch 47/125]: 37%|███▋ | 46/125 [00:03<00:05, 15.35it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
[generation sweep 1/1 | eval batch 48/125]: 38%|███▊ | 47/125 [00:03<00:05, 15.35it/s]
[generation sweep 1/1 | eval batch 48/125]: 38%|███▊ | 48/125 [00:03<00:05, 15.31it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
[generation sweep 1/1 | eval batch 49/125]: 38%|███▊ | 48/125 [00:03<00:05, 15.31it/s]
[generation sweep 1/1 | eval batch 50/125]: 39%|███▉ | 49/125 [00:03<00:04, 15.31it/s]
[generation sweep 1/1 | eval batch 50/125]: 40%|████ | 50/125 [00:03<00:04, 16.33it/s]
[generation sweep 1/1 | eval batch 51/125]: 40%|████ | 50/125 [00:03<00:04, 16.33it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
[generation sweep 1/1 | eval batch 52/125]: 41%|████ | 51/125 [00:03<00:04, 16.33it/s]
[generation sweep 1/1 | eval batch 52/125]: 42%|████▏ | 52/125 [00:03<00:04, 16.92it/s]
[generation sweep 1/1 | eval batch 53/125]: 42%|████▏ | 52/125 [00:03<00:04, 16.92it/s]
[generation sweep 1/1 | eval batch 54/125]: 42%|████▏ | 53/125 [00:03<00:04, 16.92it/s]
[generation sweep 1/1 | eval batch 54/125]: 43%|████▎ | 54/125 [00:03<00:05, 13.80it/s]
[generation sweep 1/1 | eval batch 55/125]: 43%|████▎ | 54/125 [00:03<00:05, 13.80it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
[generation sweep 1/1 | eval batch 56/125]: 44%|████▍ | 55/125 [00:03<00:05, 13.80it/s]
[generation sweep 1/1 | eval batch 56/125]: 45%|████▍ | 56/125 [00:03<00:06, 11.29it/s]
[generation sweep 1/1 | eval batch 57/125]: 45%|████▍ | 56/125 [00:04<00:06, 11.29it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
[generation sweep 1/1 | eval batch 58/125]: 46%|████▌ | 57/125 [00:04<00:06, 11.29it/s]
[generation sweep 1/1 | eval batch 58/125]: 46%|████▋ | 58/125 [00:04<00:05, 12.44it/s]
[generation sweep 1/1 | eval batch 59/125]: 46%|████▋ | 58/125 [00:04<00:05, 12.44it/s]
[generation sweep 1/1 | eval batch 60/125]: 47%|████▋ | 59/125 [00:04<00:05, 12.44it/s]
[generation sweep 1/1 | eval batch 60/125]: 48%|████▊ | 60/125 [00:04<00:04, 13.76it/s]
[generation sweep 1/1 | eval batch 61/125]: 48%|████▊ | 60/125 [00:04<00:04, 13.76it/s]
[generation sweep 1/1 | eval batch 62/125]: 49%|████▉ | 61/125 [00:04<00:04, 13.76it/s]
[generation sweep 1/1 | eval batch 62/125]: 50%|████▉ | 62/125 [00:04<00:04, 14.29it/s]
[generation sweep 1/1 | eval batch 63/125]: 50%|████▉ | 62/125 [00:04<00:04, 14.29it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
[generation sweep 1/1 | eval batch 64/125]: 50%|█████ | 63/125 [00:04<00:04, 14.29it/s]
[generation sweep 1/1 | eval batch 64/125]: 51%|█████ | 64/125 [00:04<00:03, 15.27it/s]
[generation sweep 1/1 | eval batch 65/125]: 51%|█████ | 64/125 [00:04<00:03, 15.27it/s]
[generation sweep 1/1 | eval batch 66/125]: 52%|█████▏ | 65/125 [00:04<00:03, 15.27it/s]
[generation sweep 1/1 | eval batch 66/125]: 53%|█████▎ | 66/125 [00:04<00:03, 15.71it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
[generation sweep 1/1 | eval batch 67/125]: 53%|█████▎ | 66/125 [00:04<00:03, 15.71it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
[generation sweep 1/1 | eval batch 68/125]: 54%|█████▎ | 67/125 [00:04<00:03, 15.71it/s]
[generation sweep 1/1 | eval batch 68/125]: 54%|█████▍ | 68/125 [00:04<00:04, 14.02it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
[generation sweep 1/1 | eval batch 69/125]: 54%|█████▍ | 68/125 [00:04<00:04, 14.02it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
[generation sweep 1/1 | eval batch 70/125]: 55%|█████▌ | 69/125 [00:05<00:03, 14.02it/s]
[generation sweep 1/1 | eval batch 70/125]: 56%|█████▌ | 70/125 [00:05<00:04, 11.12it/s]
[generation sweep 1/1 | eval batch 71/125]: 56%|█████▌ | 70/125 [00:05<00:04, 11.12it/s]
[generation sweep 1/1 | eval batch 72/125]: 57%|█████▋ | 71/125 [00:05<00:04, 11.12it/s]
[generation sweep 1/1 | eval batch 73/125]: 58%|█████▊ | 72/125 [00:05<00:04, 11.12it/s]
[generation sweep 1/1 | eval batch 73/125]: 58%|█████▊ | 73/125 [00:05<00:03, 14.26it/s]
[generation sweep 1/1 | eval batch 74/125]: 58%|█████▊ | 73/125 [00:05<00:03, 14.26it/s]
[generation sweep 1/1 | eval batch 75/125]: 59%|█████▉ | 74/125 [00:05<00:03, 14.26it/s]
[generation sweep 1/1 | eval batch 75/125]: 60%|██████ | 75/125 [00:05<00:03, 14.86it/s]
[generation sweep 1/1 | eval batch 76/125]: 60%|██████ | 75/125 [00:05<00:03, 14.86it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
[generation sweep 1/1 | eval batch 77/125]: 61%|██████ | 76/125 [00:05<00:03, 14.86it/s]
[generation sweep 1/1 | eval batch 77/125]: 62%|██████▏ | 77/125 [00:05<00:03, 15.31it/s]
[generation sweep 1/1 | eval batch 78/125]: 62%|██████▏ | 77/125 [00:05<00:03, 15.31it/s]
[generation sweep 1/1 | eval batch 79/125]: 62%|██████▏ | 78/125 [00:05<00:03, 15.31it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
[generation sweep 1/1 | eval batch 80/125]: 63%|██████▎ | 79/125 [00:05<00:03, 15.31it/s]
[generation sweep 1/1 | eval batch 80/125]: 64%|██████▍ | 80/125 [00:05<00:02, 16.88it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
[generation sweep 1/1 | eval batch 81/125]: 64%|██████▍ | 80/125 [00:05<00:02, 16.88it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
[generation sweep 1/1 | eval batch 82/125]: 65%|██████▍ | 81/125 [00:05<00:02, 16.88it/s]
[generation sweep 1/1 | eval batch 82/125]: 66%|██████▌ | 82/125 [00:05<00:02, 16.36it/s]
[generation sweep 1/1 | eval batch 83/125]: 66%|██████▌ | 82/125 [00:05<00:02, 16.36it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
[generation sweep 1/1 | eval batch 84/125]: 66%|██████▋ | 83/125 [00:05<00:02, 16.36it/s]
[generation sweep 1/1 | eval batch 84/125]: 67%|██████▋ | 84/125 [00:05<00:02, 14.56it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
[generation sweep 1/1 | eval batch 85/125]: 67%|██████▋ | 84/125 [00:05<00:02, 14.56it/s]
[generation sweep 1/1 | eval batch 86/125]: 68%|██████▊ | 85/125 [00:06<00:02, 14.56it/s]
[generation sweep 1/1 | eval batch 86/125]: 69%|██████▉ | 86/125 [00:06<00:03, 11.75it/s]
[generation sweep 1/1 | eval batch 87/125]: 69%|██████▉ | 86/125 [00:06<00:03, 11.75it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
[generation sweep 1/1 | eval batch 88/125]: 70%|██████▉ | 87/125 [00:06<00:03, 11.75it/s]
[generation sweep 1/1 | eval batch 88/125]: 70%|███████ | 88/125 [00:06<00:02, 12.37it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
[generation sweep 1/1 | eval batch 89/125]: 70%|███████ | 88/125 [00:06<00:02, 12.37it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
[generation sweep 1/1 | eval batch 90/125]: 71%|███████ | 89/125 [00:06<00:02, 12.37it/s]
[generation sweep 1/1 | eval batch 90/125]: 72%|███████▏ | 90/125 [00:06<00:02, 13.09it/s]
[generation sweep 1/1 | eval batch 91/125]: 72%|███████▏ | 90/125 [00:06<00:02, 13.09it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
[generation sweep 1/1 | eval batch 92/125]: 73%|███████▎ | 91/125 [00:06<00:02, 13.09it/s]
[generation sweep 1/1 | eval batch 92/125]: 74%|███████▎ | 92/125 [00:06<00:02, 13.97it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
[generation sweep 1/1 | eval batch 93/125]: 74%|███████▎ | 92/125 [00:06<00:02, 13.97it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
[generation sweep 1/1 | eval batch 94/125]: 74%|███████▍ | 93/125 [00:06<00:02, 13.97it/s]
[generation sweep 1/1 | eval batch 94/125]: 75%|███████▌ | 94/125 [00:06<00:02, 14.28it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
[generation sweep 1/1 | eval batch 95/125]: 75%|███████▌ | 94/125 [00:06<00:02, 14.28it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
[generation sweep 1/1 | eval batch 96/125]: 76%|███████▌ | 95/125 [00:06<00:02, 14.28it/s]
[generation sweep 1/1 | eval batch 96/125]: 77%|███████▋ | 96/125 [00:06<00:01, 14.54it/s]
[generation sweep 1/1 | eval batch 97/125]: 77%|███████▋ | 96/125 [00:06<00:01, 14.54it/s]
[generation sweep 1/1 | eval batch 98/125]: 78%|███████▊ | 97/125 [00:06<00:01, 14.54it/s]
[generation sweep 1/1 | eval batch 98/125]: 78%|███████▊ | 98/125 [00:06<00:01, 14.30it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
[generation sweep 1/1 | eval batch 99/125]: 78%|███████▊ | 98/125 [00:06<00:01, 14.30it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
[generation sweep 1/1 | eval batch 100/125]: 79%|███████▉ | 99/125 [00:07<00:01, 14.30it/s]
[generation sweep 1/1 | eval batch 100/125]: 80%|████████ | 100/125 [00:07<00:02, 11.09it/s]
[generation sweep 1/1 | eval batch 101/125]: 80%|████████ | 100/125 [00:07<00:02, 11.09it/s]
[generation sweep 1/1 | eval batch 102/125]: 81%|████████ | 101/125 [00:07<00:02, 11.09it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
[generation sweep 1/1 | eval batch 103/125]: 82%|████████▏ | 102/125 [00:07<00:02, 11.09it/s]
[generation sweep 1/1 | eval batch 103/125]: 82%|████████▏ | 103/125 [00:07<00:01, 13.09it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
[generation sweep 1/1 | eval batch 104/125]: 82%|████████▏ | 103/125 [00:07<00:01, 13.09it/s]
[generation sweep 1/1 | eval batch 105/125]: 83%|████████▎ | 104/125 [00:07<00:01, 13.09it/s]
[generation sweep 1/1 | eval batch 105/125]: 84%|████████▍ | 105/125 [00:07<00:01, 14.36it/s]
[generation sweep 1/1 | eval batch 106/125]: 84%|████████▍ | 105/125 [00:07<00:01, 14.36it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
[generation sweep 1/1 | eval batch 107/125]: 85%|████████▍ | 106/125 [00:07<00:01, 14.36it/s]
[generation sweep 1/1 | eval batch 107/125]: 86%|████████▌ | 107/125 [00:07<00:01, 14.96it/s]
[generation sweep 1/1 | eval batch 108/125]: 86%|████████▌ | 107/125 [00:07<00:01, 14.96it/s]
[generation sweep 1/1 | eval batch 109/125]: 86%|████████▋ | 108/125 [00:07<00:01, 14.96it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
[generation sweep 1/1 | eval batch 110/125]: 87%|████████▋ | 109/125 [00:07<00:01, 14.96it/s]
[generation sweep 1/1 | eval batch 110/125]: 88%|████████▊ | 110/125 [00:07<00:00, 16.77it/s]
[generation sweep 1/1 | eval batch 111/125]: 88%|████████▊ | 110/125 [00:07<00:00, 16.77it/s]
[generation sweep 1/1 | eval batch 112/125]: 89%|████████▉ | 111/125 [00:07<00:00, 16.77it/s]
[generation sweep 1/1 | eval batch 113/125]: 90%|████████▉ | 112/125 [00:07<00:00, 16.77it/s]
[generation sweep 1/1 | eval batch 113/125]: 90%|█████████ | 113/125 [00:07<00:00, 17.31it/s]
[generation sweep 1/1 | eval batch 114/125]: 90%|█████████ | 113/125 [00:07<00:00, 17.31it/s]
[generation sweep 1/1 | eval batch 115/125]: 91%|█████████ | 114/125 [00:07<00:00, 17.31it/s]
[generation sweep 1/1 | eval batch 115/125]: 92%|█████████▏| 115/125 [00:07<00:00, 16.58it/s]
[generation sweep 1/1 | eval batch 116/125]: 92%|█████████▏| 115/125 [00:08<00:00, 16.58it/s]
[generation sweep 1/1 | eval batch 117/125]: 93%|█████████▎| 116/125 [00:08<00:00, 16.58it/s]
[generation sweep 1/1 | eval batch 117/125]: 94%|█████████▎| 117/125 [00:08<00:00, 14.46it/s]
[generation sweep 1/1 | eval batch 118/125]: 94%|█████████▎| 117/125 [00:08<00:00, 14.46it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
[generation sweep 1/1 | eval batch 119/125]: 94%|█████████▍| 118/125 [00:08<00:00, 14.46it/s]
[generation sweep 1/1 | eval batch 119/125]: 95%|█████████▌| 119/125 [00:08<00:00, 14.92it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
[generation sweep 1/1 | eval batch 120/125]: 95%|█████████▌| 119/125 [00:08<00:00, 14.92it/s]
[generation sweep 1/1 | eval batch 121/125]: 96%|█████████▌| 120/125 [00:08<00:00, 14.92it/s]
[generation sweep 1/1 | eval batch 121/125]: 97%|█████████▋| 121/125 [00:08<00:00, 15.71it/s]
[generation sweep 1/1 | eval batch 122/125]: 97%|█████████▋| 121/125 [00:08<00:00, 15.71it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
[generation sweep 1/1 | eval batch 123/125]: 98%|█████████▊| 122/125 [00:08<00:00, 15.71it/s]
[generation sweep 1/1 | eval batch 123/125]: 98%|█████████▊| 123/125 [00:08<00:00, 16.43it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
[generation sweep 1/1 | eval batch 124/125]: 98%|█████████▊| 123/125 [00:08<00:00, 16.43it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
[generation sweep 1/1 | eval batch 125/125]: 99%|█████████▉| 124/125 [00:08<00:00, 16.43it/s]
[generation sweep 1/1 | eval batch 125/125]: 100%|██████████| 125/125 [00:08<00:00, 16.07it/s]
[generation sweep 1/1 | eval batch 125/125]: 100%|██████████| 125/125 [00:08<00:00, 14.47it/s]
[RANK 0] Computing rewards
[RANK 0] Summarizing evaluation
Traceback (most recent call last):
File "trainer_rl.py", line 119, in <module>
trainer = trlx.train(
File "/databricks/conda/envs/pytorch/lib/python3.8/site-packages/trlx/trlx.py", line 119, in train
trainer.learn()
File "/databricks/conda/envs/pytorch/lib/python3.8/site-packages/trlx/trainer/accelerate_base_trainer.py", line 455, in learn
results = self.evaluate()
File "/databricks/conda/envs/pytorch/lib/python3.8/site-packages/trlx/trainer/accelerate_base_trainer.py", line 410, in evaluate
table_title += f" {k}: {significant(x)}"
File "/databricks/conda/envs/pytorch/lib/python3.8/site-packages/trlx/utils/__init__.py", line 35, in significant
return round(x, ndigits - int(math.floor(math.log10(abs(x)))))
ValueError: cannot convert float NaN to integer
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /.../model/model_training/traine │
│ r_rl.py:119 in <module> │
│ │
│ 116 │ trlx_config.method.num_rollouts = int(training_conf.num_rollouts) │
│ 117 │ trlx_config.train.epochs = int(training_conf.epochs) │
│ 118 │ │
│ ❱ 119 │ trainer = trlx.train( │
│ 120 │ │ sft_config.model_name, │
│ 121 │ │ reward_fn=rank_model_fn, │
│ 122 │ │ prompts=prompts, │
│ │
│ /databricks/conda/envs/pytorch/lib/python3.8/site-packages/trlx/trlx.py:119 │
│ in train │
│ │
│ 116 │ eval_pipeline = get_pipeline(config.train.pipeline)(eval_prompts, │
│ 117 │ trainer.add_eval_pipeline(eval_pipeline) │
│ 118 │ │
│ ❱ 119 │ trainer.learn() │
│ 120 │ return trainer │
│ 121 │
│ │
│ /databricks/conda/envs/pytorch/lib/python3.8/site-packages/trlx/trainer/acce │
│ lerate_base_trainer.py:455 in learn │
│ │
│ 452 │ │ │ │ │ │ state = json.load(f) │
│ 453 │ │ │ │ │ │ self.iter_count = state["iter_count"] │
│ 454 │ │ else: │
│ ❱ 455 │ │ │ results = self.evaluate() │
│ 456 │ │ │ self.accelerator.log(results, step=self.iter_count) │
│ 457 │ │ │
│ 458 │ │ tbar = logging.tqdm( │
│ │
│ /databricks/conda/envs/pytorch/lib/python3.8/site-packages/trlx/trainer/acce │
│ lerate_base_trainer.py:410 in evaluate │
│ │
│ 407 │ │ │ table_title = f"Evaluation #{self.nth_evaluation}" │
│ 408 │ │ │ for k, x in stats.items(): │
│ 409 │ │ │ │ if k.startswith("reward") or k.startswith("metrics"): │
│ ❱ 410 │ │ │ │ │ table_title += f" {k}: {significant(x)}" │
│ 411 │ │ │ │
│ 412 │ │ │ rich_table = Table(*columns, title=table_title, show_lines │
│ 413 │ │ │ for ix in range(max(min(3, len(rows)), len(gen_sweep_value │
│ │
│ /databricks/conda/envs/pytorch/lib/python3.8/site-packages/trlx/utils/__init │
│ __.py:35 in significant │
│ │
│ 32 │ if not isinstance(x, Number) or x == 0: │
│ 33 │ │ return x │
│ 34 │ │
│ ❱ 35 │ return round(x, ndigits - int(math.floor(math.log10(abs(x))))) │
│ 36 │
│ 37 │
│ 38 def set_seed(seed: int): │
╰──────────────────────────────────────────────────────────────────────────────╯
ValueError: cannot convert float NaN to integer
wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing.
wandb:
wandb: Run history:
wandb: exp_scores/mean ▁
wandb: exp_scores/running_mean ▁
wandb: exp_scores/running_std ▁
wandb: exp_scores/std ▁
wandb: kl_ctl_value ▁
wandb: time/exp ▁
wandb: time/exp_generate ▁
wandb: time/exp_score ▁
wandb:
wandb: Run summary:
wandb: exp_scores/mean -0.42778
wandb: exp_scores/running_mean -0.43954
wandb: exp_scores/running_std 0.0668
wandb: exp_scores/std 0.05542
wandb: kl_ctl_value 0.04
wandb: time/exp 0.60333
wandb: time/exp_generate 0.35865
wandb: time/exp_score 0.02291
wandb:
wandb: Synced trainer_rl/checkpoint-200/4gpus:unknown: https://wandb.ai/llm2/trlx/runs/39gzhrxa
wandb: Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
I've done the following:
[accelerator launcher]
accelerate launch \
--config_file configs/default_accelerate_config.yaml \
--num_processes 4 \
--main_process_port 61000 \
trainer_rl.py \
--configs defaults_rlhf pythia_rlhf \
--output_dir $OUT_PATH \
--batch_size 1 \
--eval_size 500 \
[default_accelerate_config.yaml]
command_file: null
commands: null
compute_environment: LOCAL_MACHINE
deepspeed_config:
deepspeed_config_file: configs/ds_config_trlx_gptj_summarize.json
zero3_init_flag: false
distributed_type: DEEPSPEED
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
gpu_ids: null
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
megatron_lm_config: {}
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_name: null
tpu_zone: null
use_cpu: false
[ds_config_trlx_gptj_summarize.json]
{
"train_micro_batch_size_per_gpu": 2,
"gradient_accumulation_steps": 4,
"fp16": {
"enabled": false,
"min_loss_scale": 0.5,
"fp16_scale_tolerance": 0.25,
"opt_level": "O2"
},
"zero_optimization": {
"stage": 2,
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"allgather_partitions": true,
"allgather_bucket_size": 5e8,
"contiguous_gradients": true
}
}
[config_rl]
defaults_rlhf:
datasets:
batch_size: 1
chunk_size: 2
num_rollouts: 32
epochs: 1
datasets_extra: []
cache_dir: .cache
output_dir: model_rl
eval_size: 5
rank_config:
sft_config:
oasst_export_latin_cyrillic_rlhf:
datasets:
- oasst_export:
lang: "bg,ca,cs,da,de,en,es,fr,hr,hu,it,nl,pl,pt,ro,ru,sl,sr,sv,uk"
#top_k: 2
input_file_path: 2023-03-25_oasst_research_ready_synth_labels.jsonl.gz
sort_by_length: false
use_custom_sampler: false
pythia_rlhf:
datasets:
- webgpt:
fraction: 0.05
rank_config:
is_reward_model: true
model_name: /.../saved_model_pythia/
cache_dir: /home/ubuntu/data_cache/
pooling: last
residual_dropout: 0.08172424407561013
use_flash_attention: false
half: false
sft_config:
is_reward_model: false
model_name: /.../saved_model/checkpoint-200
cache_dir: /home/ubuntu/data_cache/
quantization: false
seq2seqmodel: false
freeze_layer:
residual_dropout: 0.1
use_flash_attention: false
half: false
batch_size: 1
debug_rlhf:
rank_model: pythia_reward_model/checkpoint-50
sft_model: pythia_sft/checkpoint-10/
batch_size: 2
log_dir: test
[ppo_config]
train:
seq_length: 520
epochs: 30
total_steps: 10000
batch_size: 18
checkpoint_interval: 2500
eval_interval: 500
pipeline: "PromptPipeline"
trainer: "CustomPPOTrainer"
tracker: wandb
model:
model_path:
num_layers_unfrozen: -1
model_arch_type: causal
tokenizer:
tokenizer_path:
truncation_side: "right"
padding_side: "left"
optimizer:
name: "adamw"
kwargs:
lr: 1.0e-6
betas: [0.9, 0.95]
eps: 1.0e-8
weight_decay: 1.0e-2
scheduler:
name: "cosine_annealing"
kwargs:
T_max: 100000 # train.total_steps
eta_min: 1.0e-4
method:
name: "ppoconfig"
num_rollouts: 32
chunk_size: 8
ppo_epochs: 4
init_kl_coef: 0.04
target: 6
horizon: 10000
gamma: 1
lam: 0.95
cliprange: 0.2
cliprange_value: 0.2
vf_coef: 1
scale_reward: False
ref_mean: null
ref_std: null
cliprange_reward: 10
gen_kwargs:
max_new_tokens: 100
top_k: 0
top_p: 0.7
do_sample: True
temperature: 0.5
I am using "trlx @ git+https://github.com/CarperAI/trlx.git@b91da7b03d8e9fa0c0d6dce10a8f2611aca3013f" as in pyproject file. The only difference would be python version. mine is python3.8 and remove 3.10 specific part (type | None business in dataset code). As long as I use only webgpt, I think I am ok...
Hi @sanagno,
I managed to run RL training with 4GPU without error messages by the following modifications. I just wanted to avoid the "decoder-only ..." error.
It would be very helpful if you tell me whether these changes make sense to you or not.
In https://github.com/LAION-AI/Open-Assistant/blob/73eb615efb0740f41b284730b3e8bce8aa53ccba/model/model_training/utils/utils.py#L194, add padding_side=conf.padding_side
learn sft model (pythia-1b) and rm model (pythia-160m) for my test. "padding_side=left" is added in both the config files.
learn rl model with the above two models.
Also, I could not understand at all why I still have the same warning (decoder-only) in the log even I set padding_side to left for all the models.
[3/30 Edited]
After having a look at some examples in trlx (like https://github.com/CarperAI/trlx/blob/e72f7d1a8008c9a994e9fe465aa4a8a7a1fb3232/examples/summarize_rlhf/trlx_gptj_text_summarization.py#L123), I understand that it is in line with your implementation.
I have not fully understood but I probably made a mistake.
I was able to run 4GPU RL training without any code change from the repo (apart from https://github.com/LAION-AI/Open-Assistant/issues/2140#issuecomment-1486472455).
Here is my setup:
Here is my accelerator launcher.
accelerate launch \
--config_file configs/default_accelerate_config.yaml \
--num_processes 4 \
--main_process_port 61000 \
trainer_rl.py \
--configs defaults_rlhf pythia_rlhf \
--output_dir $OUT_PATH \
--batch_size 1 \
--eval_size 50 \
--wandb-entity <YOURS>
I still got "decoder only ... padding_shift=left" warning... , I am going to dig out a bit more.
Thank you very much for your advice.
It was too early to conclude...
I ran the same script with eval_size=500 and failed with the following messages...
Traceback (most recent call last):
File "trainer_rl.py", line 119, in <module>
trainer = trlx.train(
File "/databricks/conda/envs/pytorch/lib/python3.8/site-packages/trlx/trlx.py", line 119, in train
trainer.learn()
File "/databricks/conda/envs/pytorch/lib/python3.8/site-packages/trlx/trainer/accelerate_base_trainer.py", line 455, in learn
results = self.evaluate()
File "/databricks/conda/envs/pytorch/lib/python3.8/site-packages/trlx/trainer/accelerate_base_trainer.py", line 410, in evaluate
table_title += f" {k}: {significant(x)}"
File "/databricks/conda/envs/pytorch/lib/python3.8/site-packages/trlx/utils/__init__.py", line 35, in significant
return round(x, ndigits - int(math.floor(math.log10(abs(x)))))
ValueError: cannot convert float NaN to integer
Hi,
I succeeded in running SFT and RM training in multi-gpu environment.
With the two learnt models, I tried to run RL training again in multi-gpu setup:
and with the following script.
I modified config_rl.yaml below:
also modified ppo_config.yaml just to add wandb tracker
Then, I have got the following error message. It looks like eval_prompts are not properly generated and failed miserably in evaluation...
BTW, I was able to run the RL training with single-gpu.
I am stuck for a couple of days already... It would be very helpful if you tell me any advice to sort it out.