NiuTrans / Vision-LLM-Alignment

This repository contains the code for SFT, RLHF, and DPO, designed for vision-based LLMs, including the LLaVA models and the LLaMA-3.2-vision models.
77 stars 4 forks source link

When use LLaMA-3.2-Vision for reward model training, I encounter an issue #9

Closed hhhhzzzzz closed 2 weeks ago

hhhhzzzzz commented 3 weeks ago

frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f3f32cae6fc in /opt/conda/envs/vrm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #7: + 0xd3e95 (0x7f40271b5e95 in /opt/conda/envs/vrm/bin/../lib/libstdc++.so.6) frame #8: + 0x744b (0x7f4030bdf44b in /lib64/libpthread.so.0) frame #9: clone + 0x3f (0x7f40301d352f in /lib64/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f3f319a8f86 in /opt/conda/envs/vrm/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe5aa84 (0x7f3f32937a84 in /opt/conda/envs/vrm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0xd3e95 (0x7f40271b5e95 in /opt/conda/envs/vrm/bin/../lib/libstdc++.so.6) frame #3: + 0x744b (0x7f4030bdf44b in /lib64/libpthread.so.0) frame #4: clone + 0x3f (0x7f40301d352f in /lib64/libc.so.6)

wangclnlp commented 3 weeks ago

This seems to be a problem with the Python environment. Could you please check the latest requirements.txt? Some libraries need to be updated when using llama-3.2-vision. You can also share your used version of these libraries (including the Cuda version), and I'll check them.

hhhhzzzzz commented 3 weeks ago

Hi,

When I use per_device_train_batch_size=1, it will cause the bug.

wangclnlp commented 3 weeks ago

Thanks for your feedback, we will fix it as soon as possible.

hhhhzzzzz commented 3 weeks ago

And could you provide roles for mistral and llama3.2 (reward model training)?

Thanks!

wangclnlp commented 2 weeks ago

Hi,

When I use per_device_train_batch_size=1, it will cause the bug.

We have fixed some bugs related to training a reward model with LLaMA-3.2-vision. Please update your code using the latest version (git pull). Below is the log from our test. Let us know if you have any further questions.

[2024-10-10 20:10:22,503] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-10 20:10:24,049] [WARNING] [runner.py:212:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2024-10-10 20:10:24,049] [INFO] [runner.py:585:main] cmd = /localnvme/application/sc_new/miniconda3/envs/wcl_rlhf_new/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=12335 --enable_each_rank_log=None training/reward_model_training/rm_training_main.py --max_seq_len 2048 --image_folder /localnvme/application/sc_new/wangchenglong_56/rlhf_llama_vision/data/RLAIF-V-Dataset/images --template llama-3.2-vision --data_path /localnvme/application/sc_new/wangchenglong_56/rlhf_llama_vision/data/RLAIF-V-Dataset/rlaif_v_dataset_train.json --eval_data_path /localnvme/application/sc_new/wangchenglong_56/rlhf_llama_vision/data/RLAIF-V-Dataset/rlaif_v_dataset_test.json --dataset_names llava_reward --dataset_samples all --dataset_concatenate_samples 1 --max_num_image_per_sample 8 --lm_reward_model_name_or_path none --vision_reward_model_name_or_path none --gradient_checkpointing --vis_proj baseline --gradient_accumulation_steps 2 --zero_stage 3 --learning_rate 1e-6 --num_warmup_steps 0.1 --per_device_train_batch_size 1 --per_device_eval_batch_size 8 --eval_step 200 --deepspeed --output_dir models/test --num_train_epochs 1 --lang_decoder_update --enable_mmca_attention --model_architecture llama-3.2-vision --trained_reward_model none --save_step 9900 --precision bf16 --ranked_candidate_num 2 --from_checkpoint /localnvme/application/sc_new/wangchenglong_56/base_models/llama-3.2-11b-vision-instruct
[2024-10-10 20:10:25,115] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-10 20:10:26,611] [INFO] [launch.py:146:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
[2024-10-10 20:10:26,611] [INFO] [launch.py:152:main] nnodes=1, num_local_procs=8, node_rank=0
[2024-10-10 20:10:26,611] [INFO] [launch.py:163:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[2024-10-10 20:10:26,611] [INFO] [launch.py:164:main] dist_world_size=8
[2024-10-10 20:10:26,611] [INFO] [launch.py:168:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
[2024-10-10 20:10:26,611] [INFO] [launch.py:256:main] process 2199283 spawned with command: ['/localnvme/application/sc_new/miniconda3/envs/wcl_rlhf_new/bin/python', '-u', 'training/reward_model_training/rm_training_main.py', '--local_rank=0', '--max_seq_len', '2048', '--image_folder', '/localnvme/application/sc_new/wangchenglong_56/rlhf_llama_vision/data/RLAIF-V-Dataset/images', '--template', 'llama-3.2-vision', '--data_path', '/localnvme/application/sc_new/wangchenglong_56/rlhf_llama_vision/data/RLAIF-V-Dataset/rlaif_v_dataset_train.json', '--eval_data_path', '/localnvme/application/sc_new/wangchenglong_56/rlhf_llama_vision/data/RLAIF-V-Dataset/rlaif_v_dataset_test.json', '--dataset_names', 'llava_reward', '--dataset_samples', 'all', '--dataset_concatenate_samples', '1', '--max_num_image_per_sample', '8', '--lm_reward_model_name_or_path', 'none', '--vision_reward_model_name_or_path', 'none', '--gradient_checkpointing', '--vis_proj', 'baseline', '--gradient_accumulation_steps', '2', '--zero_stage', '3', '--learning_rate', '1e-6', '--num_warmup_steps', '0.1', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '8', '--eval_step', '200', '--deepspeed', '--output_dir', 'models/test', '--num_train_epochs', '1', '--lang_decoder_update', '--enable_mmca_attention', '--model_architecture', 'llama-3.2-vision', '--trained_reward_model', 'none', '--save_step', '9900', '--precision', 'bf16', '--ranked_candidate_num', '2', '--from_checkpoint', '/localnvme/application/sc_new/wangchenglong_56/base_models/llama-3.2-11b-vision-instruct']
[2024-10-10 20:10:26,612] [INFO] [launch.py:256:main] process 2199284 spawned with command: ['/localnvme/application/sc_new/miniconda3/envs/wcl_rlhf_new/bin/python', '-u', 'training/reward_model_training/rm_training_main.py', '--local_rank=1', '--max_seq_len', '2048', '--image_folder', '/localnvme/application/sc_new/wangchenglong_56/rlhf_llama_vision/data/RLAIF-V-Dataset/images', '--template', 'llama-3.2-vision', '--data_path', '/localnvme/application/sc_new/wangchenglong_56/rlhf_llama_vision/data/RLAIF-V-Dataset/rlaif_v_dataset_train.json', '--eval_data_path', '/localnvme/application/sc_new/wangchenglong_56/rlhf_llama_vision/data/RLAIF-V-Dataset/rlaif_v_dataset_test.json', '--dataset_names', 'llava_reward', '--dataset_samples', 'all', '--dataset_concatenate_samples', '1', '--max_num_image_per_sample', '8', '--lm_reward_model_name_or_path', 'none', '--vision_reward_model_name_or_path', 'none', '--gradient_checkpointing', '--vis_proj', 'baseline', '--gradient_accumulation_steps', '2', '--zero_stage', '3', '--learning_rate', '1e-6', '--num_warmup_steps', '0.1', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '8', '--eval_step', '200', '--deepspeed', '--output_dir', 'models/test', '--num_train_epochs', '1', '--lang_decoder_update', '--enable_mmca_attention', '--model_architecture', 'llama-3.2-vision', '--trained_reward_model', 'none', '--save_step', '9900', '--precision', 'bf16', '--ranked_candidate_num', '2', '--from_checkpoint', '/localnvme/application/sc_new/wangchenglong_56/base_models/llama-3.2-11b-vision-instruct']
[2024-10-10 20:10:26,612] [INFO] [launch.py:256:main] process 2199285 spawned with command: ['/localnvme/application/sc_new/miniconda3/envs/wcl_rlhf_new/bin/python', '-u', 'training/reward_model_training/rm_training_main.py', '--local_rank=2', '--max_seq_len', '2048', '--image_folder', '/localnvme/application/sc_new/wangchenglong_56/rlhf_llama_vision/data/RLAIF-V-Dataset/images', '--template', 'llama-3.2-vision', '--data_path', '/localnvme/application/sc_new/wangchenglong_56/rlhf_llama_vision/data/RLAIF-V-Dataset/rlaif_v_dataset_train.json', '--eval_data_path', '/localnvme/application/sc_new/wangchenglong_56/rlhf_llama_vision/data/RLAIF-V-Dataset/rlaif_v_dataset_test.json', '--dataset_names', 'llava_reward', '--dataset_samples', 'all', '--dataset_concatenate_samples', '1', '--max_num_image_per_sample', '8', '--lm_reward_model_name_or_path', 'none', '--vision_reward_model_name_or_path', 'none', '--gradient_checkpointing', '--vis_proj', 'baseline', '--gradient_accumulation_steps', '2', '--zero_stage', '3', '--learning_rate', '1e-6', '--num_warmup_steps', '0.1', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '8', '--eval_step', '200', '--deepspeed', '--output_dir', 'models/test', '--num_train_epochs', '1', '--lang_decoder_update', '--enable_mmca_attention', '--model_architecture', 'llama-3.2-vision', '--trained_reward_model', 'none', '--save_step', '9900', '--precision', 'bf16', '--ranked_candidate_num', '2', '--from_checkpoint', '/localnvme/application/sc_new/wangchenglong_56/base_models/llama-3.2-11b-vision-instruct']
[2024-10-10 20:10:26,613] [INFO] [launch.py:256:main] process 2199286 spawned with command: ['/localnvme/application/sc_new/miniconda3/envs/wcl_rlhf_new/bin/python', '-u', 'training/reward_model_training/rm_training_main.py', '--local_rank=3', '--max_seq_len', '2048', '--image_folder', '/localnvme/application/sc_new/wangchenglong_56/rlhf_llama_vision/data/RLAIF-V-Dataset/images', '--template', 'llama-3.2-vision', '--data_path', '/localnvme/application/sc_new/wangchenglong_56/rlhf_llama_vision/data/RLAIF-V-Dataset/rlaif_v_dataset_train.json', '--eval_data_path', '/localnvme/application/sc_new/wangchenglong_56/rlhf_llama_vision/data/RLAIF-V-Dataset/rlaif_v_dataset_test.json', '--dataset_names', 'llava_reward', '--dataset_samples', 'all', '--dataset_concatenate_samples', '1', '--max_num_image_per_sample', '8', '--lm_reward_model_name_or_path', 'none', '--vision_reward_model_name_or_path', 'none', '--gradient_checkpointing', '--vis_proj', 'baseline', '--gradient_accumulation_steps', '2', '--zero_stage', '3', '--learning_rate', '1e-6', '--num_warmup_steps', '0.1', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '8', '--eval_step', '200', '--deepspeed', '--output_dir', 'models/test', '--num_train_epochs', '1', '--lang_decoder_update', '--enable_mmca_attention', '--model_architecture', 'llama-3.2-vision', '--trained_reward_model', 'none', '--save_step', '9900', '--precision', 'bf16', '--ranked_candidate_num', '2', '--from_checkpoint', '/localnvme/application/sc_new/wangchenglong_56/base_models/llama-3.2-11b-vision-instruct']
[2024-10-10 20:10:26,613] [INFO] [launch.py:256:main] process 2199287 spawned with command: ['/localnvme/application/sc_new/miniconda3/envs/wcl_rlhf_new/bin/python', '-u', 'training/reward_model_training/rm_training_main.py', '--local_rank=4', '--max_seq_len', '2048', '--image_folder', '/localnvme/application/sc_new/wangchenglong_56/rlhf_llama_vision/data/RLAIF-V-Dataset/images', '--template', 'llama-3.2-vision', '--data_path', '/localnvme/application/sc_new/wangchenglong_56/rlhf_llama_vision/data/RLAIF-V-Dataset/rlaif_v_dataset_train.json', '--eval_data_path', '/localnvme/application/sc_new/wangchenglong_56/rlhf_llama_vision/data/RLAIF-V-Dataset/rlaif_v_dataset_test.json', '--dataset_names', 'llava_reward', '--dataset_samples', 'all', '--dataset_concatenate_samples', '1', '--max_num_image_per_sample', '8', '--lm_reward_model_name_or_path', 'none', '--vision_reward_model_name_or_path', 'none', '--gradient_checkpointing', '--vis_proj', 'baseline', '--gradient_accumulation_steps', '2', '--zero_stage', '3', '--learning_rate', '1e-6', '--num_warmup_steps', '0.1', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '8', '--eval_step', '200', '--deepspeed', '--output_dir', 'models/test', '--num_train_epochs', '1', '--lang_decoder_update', '--enable_mmca_attention', '--model_architecture', 'llama-3.2-vision', '--trained_reward_model', 'none', '--save_step', '9900', '--precision', 'bf16', '--ranked_candidate_num', '2', '--from_checkpoint', '/localnvme/application/sc_new/wangchenglong_56/base_models/llama-3.2-11b-vision-instruct']
[2024-10-10 20:10:26,614] [INFO] [launch.py:256:main] process 2199288 spawned with command: ['/localnvme/application/sc_new/miniconda3/envs/wcl_rlhf_new/bin/python', '-u', 'training/reward_model_training/rm_training_main.py', '--local_rank=5', '--max_seq_len', '2048', '--image_folder', '/localnvme/application/sc_new/wangchenglong_56/rlhf_llama_vision/data/RLAIF-V-Dataset/images', '--template', 'llama-3.2-vision', '--data_path', '/localnvme/application/sc_new/wangchenglong_56/rlhf_llama_vision/data/RLAIF-V-Dataset/rlaif_v_dataset_train.json', '--eval_data_path', '/localnvme/application/sc_new/wangchenglong_56/rlhf_llama_vision/data/RLAIF-V-Dataset/rlaif_v_dataset_test.json', '--dataset_names', 'llava_reward', '--dataset_samples', 'all', '--dataset_concatenate_samples', '1', '--max_num_image_per_sample', '8', '--lm_reward_model_name_or_path', 'none', '--vision_reward_model_name_or_path', 'none', '--gradient_checkpointing', '--vis_proj', 'baseline', '--gradient_accumulation_steps', '2', '--zero_stage', '3', '--learning_rate', '1e-6', '--num_warmup_steps', '0.1', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '8', '--eval_step', '200', '--deepspeed', '--output_dir', 'models/test', '--num_train_epochs', '1', '--lang_decoder_update', '--enable_mmca_attention', '--model_architecture', 'llama-3.2-vision', '--trained_reward_model', 'none', '--save_step', '9900', '--precision', 'bf16', '--ranked_candidate_num', '2', '--from_checkpoint', '/localnvme/application/sc_new/wangchenglong_56/base_models/llama-3.2-11b-vision-instruct']
[2024-10-10 20:10:26,614] [INFO] [launch.py:256:main] process 2199289 spawned with command: ['/localnvme/application/sc_new/miniconda3/envs/wcl_rlhf_new/bin/python', '-u', 'training/reward_model_training/rm_training_main.py', '--local_rank=6', '--max_seq_len', '2048', '--image_folder', '/localnvme/application/sc_new/wangchenglong_56/rlhf_llama_vision/data/RLAIF-V-Dataset/images', '--template', 'llama-3.2-vision', '--data_path', '/localnvme/application/sc_new/wangchenglong_56/rlhf_llama_vision/data/RLAIF-V-Dataset/rlaif_v_dataset_train.json', '--eval_data_path', '/localnvme/application/sc_new/wangchenglong_56/rlhf_llama_vision/data/RLAIF-V-Dataset/rlaif_v_dataset_test.json', '--dataset_names', 'llava_reward', '--dataset_samples', 'all', '--dataset_concatenate_samples', '1', '--max_num_image_per_sample', '8', '--lm_reward_model_name_or_path', 'none', '--vision_reward_model_name_or_path', 'none', '--gradient_checkpointing', '--vis_proj', 'baseline', '--gradient_accumulation_steps', '2', '--zero_stage', '3', '--learning_rate', '1e-6', '--num_warmup_steps', '0.1', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '8', '--eval_step', '200', '--deepspeed', '--output_dir', 'models/test', '--num_train_epochs', '1', '--lang_decoder_update', '--enable_mmca_attention', '--model_architecture', 'llama-3.2-vision', '--trained_reward_model', 'none', '--save_step', '9900', '--precision', 'bf16', '--ranked_candidate_num', '2', '--from_checkpoint', '/localnvme/application/sc_new/wangchenglong_56/base_models/llama-3.2-11b-vision-instruct']
[2024-10-10 20:10:26,615] [INFO] [launch.py:256:main] process 2199290 spawned with command: ['/localnvme/application/sc_new/miniconda3/envs/wcl_rlhf_new/bin/python', '-u', 'training/reward_model_training/rm_training_main.py', '--local_rank=7', '--max_seq_len', '2048', '--image_folder', '/localnvme/application/sc_new/wangchenglong_56/rlhf_llama_vision/data/RLAIF-V-Dataset/images', '--template', 'llama-3.2-vision', '--data_path', '/localnvme/application/sc_new/wangchenglong_56/rlhf_llama_vision/data/RLAIF-V-Dataset/rlaif_v_dataset_train.json', '--eval_data_path', '/localnvme/application/sc_new/wangchenglong_56/rlhf_llama_vision/data/RLAIF-V-Dataset/rlaif_v_dataset_test.json', '--dataset_names', 'llava_reward', '--dataset_samples', 'all', '--dataset_concatenate_samples', '1', '--max_num_image_per_sample', '8', '--lm_reward_model_name_or_path', 'none', '--vision_reward_model_name_or_path', 'none', '--gradient_checkpointing', '--vis_proj', 'baseline', '--gradient_accumulation_steps', '2', '--zero_stage', '3', '--learning_rate', '1e-6', '--num_warmup_steps', '0.1', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '8', '--eval_step', '200', '--deepspeed', '--output_dir', 'models/test', '--num_train_epochs', '1', '--lang_decoder_update', '--enable_mmca_attention', '--model_architecture', 'llama-3.2-vision', '--trained_reward_model', 'none', '--save_step', '9900', '--precision', 'bf16', '--ranked_candidate_num', '2', '--from_checkpoint', '/localnvme/application/sc_new/wangchenglong_56/base_models/llama-3.2-11b-vision-instruct']
[2024-10-10 20:10:28,222] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-10 20:10:28,242] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-10 20:10:28,264] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-10 20:10:28,264] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-10 20:10:28,265] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-10 20:10:28,303] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-10 20:10:28,320] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-10 20:10:28,382] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/localnvme/application/sc_new/miniconda3/envs/wcl_rlhf_new/lib/python3.10/site-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
/localnvme/application/sc_new/miniconda3/envs/wcl_rlhf_new/lib/python3.10/site-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
/localnvme/application/sc_new/miniconda3/envs/wcl_rlhf_new/lib/python3.10/site-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
/localnvme/application/sc_new/miniconda3/envs/wcl_rlhf_new/lib/python3.10/site-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
/localnvme/application/sc_new/miniconda3/envs/wcl_rlhf_new/lib/python3.10/site-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
[2024-10-10 20:10:29,733] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-10-10 20:10:29,733] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
/localnvme/application/sc_new/miniconda3/envs/wcl_rlhf_new/lib/python3.10/site-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
/localnvme/application/sc_new/miniconda3/envs/wcl_rlhf_new/lib/python3.10/site-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
/localnvme/application/sc_new/miniconda3/envs/wcl_rlhf_new/lib/python3.10/site-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
[2024-10-10 20:10:30,086] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-10-10 20:10:30,091] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-10-10 20:10:30,093] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-10-10 20:10:30,114] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-10-10 20:10:30,114] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-10-10 20:10:30,199] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-10-10 20:10:30,199] [INFO] [comm.py:652:init_distributed] cdb=None

Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards:  20%|██        | 1/5 [00:07<00:30,  7.67s/it]
Loading checkpoint shards:  20%|██        | 1/5 [00:08<00:35,  8.76s/it]
Loading checkpoint shards:  20%|██        | 1/5 [00:09<00:36,  9.07s/it]
Loading checkpoint shards:  40%|████      | 2/5 [00:10<00:13,  4.64s/it]
Loading checkpoint shards:  40%|████      | 2/5 [00:11<00:15,  5.31s/it]
Loading checkpoint shards:  40%|████      | 2/5 [00:12<00:16,  5.64s/it]
Loading checkpoint shards:  60%|██████    | 3/5 [00:12<00:07,  3.68s/it]
Loading checkpoint shards:  60%|██████    | 3/5 [00:14<00:08,  4.06s/it]
Loading checkpoint shards:  60%|██████    | 3/5 [00:15<00:08,  4.30s/it]
Loading checkpoint shards:  80%|████████  | 4/5 [00:15<00:03,  3.16s/it]
Loading checkpoint shards:  20%|██        | 1/5 [00:15<01:00, 15.23s/it]
Loading checkpoint shards:  20%|██        | 1/5 [00:15<01:00, 15.21s/it]
Loading checkpoint shards:  20%|██        | 1/5 [00:15<01:01, 15.30s/it]
Loading checkpoint shards:  20%|██        | 1/5 [00:15<01:01, 15.41s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:15<00:00,  2.15s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:15<00:00,  3.09s/it]

Loading checkpoint shards:  20%|██        | 1/5 [00:15<01:02, 15.57s/it]
Loading checkpoint shards:  80%|████████  | 4/5 [00:16<00:03,  3.36s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:16<00:00,  2.30s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:16<00:00,  3.39s/it]

Loading checkpoint shards: 100%|██████████| 5/5 [00:17<00:00,  2.39s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:17<00:00,  3.55s/it]

Loading checkpoint shards:  40%|████      | 2/5 [00:19<00:25,  8.56s/it]
Loading checkpoint shards:  40%|████      | 2/5 [00:19<00:25,  8.59s/it]

/localnvme/application/sc_new/miniconda3/envs/wcl_rlhf_new/lib/python3.10/site-packages/transformers/optimization.py:591: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
[2024-10-10 20:10:51,062] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 8

/localnvme/application/sc_new/miniconda3/envs/wcl_rlhf_new/lib/python3.10/site-packages/transformers/optimization.py:591: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
[2024-10-10 20:10:53,379] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 8

/localnvme/application/sc_new/miniconda3/envs/wcl_rlhf_new/lib/python3.10/site-packages/transformers/optimization.py:591: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
[2024-10-10 20:10:53,617] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 8

Loading checkpoint shards:  60%|██████    | 3/5 [00:22<00:12,  6.35s/it]
Loading checkpoint shards:  60%|██████    | 3/5 [00:23<00:12,  6.44s/it]
Loading checkpoint shards:  60%|██████    | 3/5 [00:23<00:12,  6.44s/it]
Loading checkpoint shards:  60%|██████    | 3/5 [00:23<00:12,  6.41s/it]
Loading checkpoint shards:  60%|██████    | 3/5 [00:23<00:13,  6.51s/it]
Loading checkpoint shards:  80%|████████  | 4/5 [00:28<00:05,  5.91s/it]
Loading checkpoint shards:  80%|████████  | 4/5 [00:28<00:05,  5.95s/it]
Loading checkpoint shards:  80%|████████  | 4/5 [00:28<00:05,  5.99s/it]
Loading checkpoint shards:  80%|████████  | 4/5 [00:28<00:06,  6.05s/it]
Loading checkpoint shards:  80%|████████  | 4/5 [00:28<00:06,  6.05s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:28<00:00,  3.97s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:28<00:00,  5.73s/it]

Loading checkpoint shards: 100%|██████████| 5/5 [00:28<00:00,  4.01s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:28<00:00,  5.79s/it]

Loading checkpoint shards: 100%|██████████| 5/5 [00:29<00:00,  4.03s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:29<00:00,  5.81s/it]

Loading checkpoint shards: 100%|██████████| 5/5 [00:28<00:00,  4.05s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:28<00:00,  5.79s/it]

Loading checkpoint shards: 100%|██████████| 5/5 [00:29<00:00,  4.07s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:29<00:00,  5.81s/it]

ViRewardModel(
  (v_head): Linear(in_features=4096, out_features=1, bias=False)
  (rwtranrsformer): MllamaForConditionalGeneration(
    (vision_model): MllamaVisionModel(
      (patch_embedding): Conv2d(3, 1280, kernel_size=(14, 14), stride=(14, 14), padding=valid, bias=False)
      (gated_positional_embedding): MllamaPrecomputedPositionEmbedding(
        (tile_embedding): Embedding(9, 8197120)
      )
      (pre_tile_positional_embedding): MllamaPrecomputedAspectRatioEmbedding(
        (embedding): Embedding(9, 5120)
      )
      (post_tile_positional_embedding): MllamaPrecomputedAspectRatioEmbedding(
        (embedding): Embedding(9, 5120)
      )
      (layernorm_pre): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
      (layernorm_post): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
      (transformer): MllamaVisionEncoder(
        (layers): ModuleList(
          (0-31): 32 x MllamaVisionEncoderLayer(
            (self_attn): MllamaVisionSdpaAttention(
              (q_proj): Linear(in_features=1280, out_features=1280, bias=False)
              (k_proj): Linear(in_features=1280, out_features=1280, bias=False)
              (v_proj): Linear(in_features=1280, out_features=1280, bias=False)
              (o_proj): Linear(in_features=1280, out_features=1280, bias=False)
            )
            (mlp): MllamaVisionMLP(
              (activation_fn): GELUActivation()
              (fc1): Linear(in_features=1280, out_features=5120, bias=True)
              (fc2): Linear(in_features=5120, out_features=1280, bias=True)
            )
            (input_layernorm): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
            (post_attention_layernorm): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
          )
        )
      )
      (global_transformer): MllamaVisionEncoder(
        (layers): ModuleList(
          (0-7): 8 x MllamaVisionEncoderLayer(
            (self_attn): MllamaVisionSdpaAttention(
              (q_proj): Linear(in_features=1280, out_features=1280, bias=False)
              (k_proj): Linear(in_features=1280, out_features=1280, bias=False)
              (v_proj): Linear(in_features=1280, out_features=1280, bias=False)
              (o_proj): Linear(in_features=1280, out_features=1280, bias=False)
            )
            (mlp): MllamaVisionMLP(
              (activation_fn): GELUActivation()
              (fc1): Linear(in_features=1280, out_features=5120, bias=True)
              (fc2): Linear(in_features=5120, out_features=1280, bias=True)
            )
            (input_layernorm): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
            (post_attention_layernorm): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
          )
        )
      )
    )
    (language_model): MllamaForCausalLM(
      (model): MllamaTextModel(
        (embed_tokens): Embedding(128264, 4096, padding_idx=128004)
        (layers): ModuleList(
          (0-2): 3 x MllamaSelfAttentionDecoderLayer(
            (self_attn): MllamaTextSelfSdpaAttention(
              (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
              (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
              (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
              (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
            )
            (mlp): MllamaTextMLP(
              (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
              (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
              (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
              (act_fn): SiLU()
            )
            (input_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
            (post_attention_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
          )
          (3): MllamaCrossAttentionDecoderLayer(
            (cross_attn): MllamaTextCrossSdpaAttention(
              (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
              (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
              (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
              (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
              (q_norm): MllamaTextRMSNorm((128,), eps=1e-05)
              (k_norm): MllamaTextRMSNorm((128,), eps=1e-05)
            )
            (input_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
            (mlp): MllamaTextMLP(
              (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
              (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
              (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
              (act_fn): SiLU()
            )
            (post_attention_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
          )
          (4-7): 4 x MllamaSelfAttentionDecoderLayer(
            (self_attn): MllamaTextSelfSdpaAttention(
              (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
              (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
              (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
              (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
            )
            (mlp): MllamaTextMLP(
              (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
              (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
              (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
              (act_fn): SiLU()
            )
            (input_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
            (post_attention_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
          )
          (8): MllamaCrossAttentionDecoderLayer(
            (cross_attn): MllamaTextCrossSdpaAttention(
              (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
              (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
              (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
              (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
              (q_norm): MllamaTextRMSNorm((128,), eps=1e-05)
              (k_norm): MllamaTextRMSNorm((128,), eps=1e-05)
            )
            (input_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
            (mlp): MllamaTextMLP(
              (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
              (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
              (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
              (act_fn): SiLU()
            )
            (post_attention_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
          )
          (9-12): 4 x MllamaSelfAttentionDecoderLayer(
            (self_attn): MllamaTextSelfSdpaAttention(
              (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
              (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
              (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
              (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
            )
            (mlp): MllamaTextMLP(
              (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
              (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
              (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
              (act_fn): SiLU()
            )
            (input_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
            (post_attention_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
          )
          (13): MllamaCrossAttentionDecoderLayer(
            (cross_attn): MllamaTextCrossSdpaAttention(
              (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
              (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
              (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
              (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
              (q_norm): MllamaTextRMSNorm((128,), eps=1e-05)
              (k_norm): MllamaTextRMSNorm((128,), eps=1e-05)
            )
            (input_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
            (mlp): MllamaTextMLP(
              (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
              (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
              (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
              (act_fn): SiLU()
            )
            (post_attention_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
          )
          (14-17): 4 x MllamaSelfAttentionDecoderLayer(
            (self_attn): MllamaTextSelfSdpaAttention(
              (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
              (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
              (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
              (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
            )
            (mlp): MllamaTextMLP(
              (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
              (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
              (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
              (act_fn): SiLU()
            )
            (input_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
            (post_attention_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
          )
          (18): MllamaCrossAttentionDecoderLayer(
            (cross_attn): MllamaTextCrossSdpaAttention(
              (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
              (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
              (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
              (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
              (q_norm): MllamaTextRMSNorm((128,), eps=1e-05)
              (k_norm): MllamaTextRMSNorm((128,), eps=1e-05)
            )
            (input_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
            (mlp): MllamaTextMLP(
              (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
              (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
              (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
              (act_fn): SiLU()
            )
            (post_attention_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
          )
          (19-22): 4 x MllamaSelfAttentionDecoderLayer(
            (self_attn): MllamaTextSelfSdpaAttention(
              (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
              (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
              (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
              (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
            )
            (mlp): MllamaTextMLP(
              (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
              (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
              (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
              (act_fn): SiLU()
            )
            (input_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
            (post_attention_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
          )
          (23): MllamaCrossAttentionDecoderLayer(
            (cross_attn): MllamaTextCrossSdpaAttention(
              (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
              (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
              (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
              (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
              (q_norm): MllamaTextRMSNorm((128,), eps=1e-05)
              (k_norm): MllamaTextRMSNorm((128,), eps=1e-05)
            )
            (input_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
            (mlp): MllamaTextMLP(
              (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
              (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
              (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
              (act_fn): SiLU()
            )
            (post_attention_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
          )
          (24-27): 4 x MllamaSelfAttentionDecoderLayer(
            (self_attn): MllamaTextSelfSdpaAttention(
              (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
              (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
              (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
              (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
            )
            (mlp): MllamaTextMLP(
              (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
              (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
              (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
              (act_fn): SiLU()
            )
            (input_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
            (post_attention_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
          )
          (28): MllamaCrossAttentionDecoderLayer(
            (cross_attn): MllamaTextCrossSdpaAttention(
              (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
              (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
              (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
              (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
              (q_norm): MllamaTextRMSNorm((128,), eps=1e-05)
              (k_norm): MllamaTextRMSNorm((128,), eps=1e-05)
            )
            (input_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
            (mlp): MllamaTextMLP(
              (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
              (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
              (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
              (act_fn): SiLU()
            )
            (post_attention_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
          )
          (29-32): 4 x MllamaSelfAttentionDecoderLayer(
            (self_attn): MllamaTextSelfSdpaAttention(
              (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
              (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
              (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
              (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
            )
            (mlp): MllamaTextMLP(
              (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
              (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
              (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
              (act_fn): SiLU()
            )
            (input_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
            (post_attention_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
          )
          (33): MllamaCrossAttentionDecoderLayer(
            (cross_attn): MllamaTextCrossSdpaAttention(
              (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
              (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
              (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
              (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
              (q_norm): MllamaTextRMSNorm((128,), eps=1e-05)
              (k_norm): MllamaTextRMSNorm((128,), eps=1e-05)
            )
            (input_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
            (mlp): MllamaTextMLP(
              (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
              (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
              (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
              (act_fn): SiLU()
            )
            (post_attention_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
          )
          (34-37): 4 x MllamaSelfAttentionDecoderLayer(
            (self_attn): MllamaTextSelfSdpaAttention(
              (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
              (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
              (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
              (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
            )
            (mlp): MllamaTextMLP(
              (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
              (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
              (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
              (act_fn): SiLU()
            )
            (input_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
            (post_attention_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
          )
          (38): MllamaCrossAttentionDecoderLayer(
            (cross_attn): MllamaTextCrossSdpaAttention(
              (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
              (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
              (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
              (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
              (q_norm): MllamaTextRMSNorm((128,), eps=1e-05)
              (k_norm): MllamaTextRMSNorm((128,), eps=1e-05)
            )
            (input_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
            (mlp): MllamaTextMLP(
              (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
              (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
              (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
              (act_fn): SiLU()
            )
            (post_attention_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
          )
          (39): MllamaSelfAttentionDecoderLayer(
            (self_attn): MllamaTextSelfSdpaAttention(
              (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
              (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
              (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
              (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
            )
            (mlp): MllamaTextMLP(
              (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
              (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
              (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
              (act_fn): SiLU()
            )
            (input_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
            (post_attention_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
          )
        )
        (norm): MllamaTextRMSNorm((4096,), eps=1e-05)
        (rotary_emb): MllamaRotaryEmbedding()
      )
      (lm_head): Linear(in_features=4096, out_features=128256, bias=False)
    )
    (multi_modal_projector): Linear(in_features=7680, out_features=4096, bias=True)
  )
)

/localnvme/application/sc_new/miniconda3/envs/wcl_rlhf_new/lib/python3.10/site-packages/transformers/optimization.py:591: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
[2024-10-10 20:11:04,325] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 8
[DATA] Built dataset llava_reward with all 82132 samples.

[DATA] Built dataset llava_reward with all 1000 samples.
/localnvme/application/sc_new/miniconda3/envs/wcl_rlhf_new/lib/python3.10/site-packages/transformers/optimization.py:591: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
[2024-10-10 20:11:04,539] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.15.1, git-hash=unknown, git-branch=unknown
[2024-10-10 20:11:04,539] [INFO] [comm.py:677:init_distributed] Distributed backend already initialized
[2024-10-10 20:11:04,539] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 8

/localnvme/application/sc_new/miniconda3/envs/wcl_rlhf_new/lib/python3.10/site-packages/transformers/optimization.py:591: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
[2024-10-10 20:11:04,658] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 8

/localnvme/application/sc_new/miniconda3/envs/wcl_rlhf_new/lib/python3.10/site-packages/transformers/optimization.py:591: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
[2024-10-10 20:11:04,724] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 8

/localnvme/application/sc_new/miniconda3/envs/wcl_rlhf_new/lib/python3.10/site-packages/transformers/optimization.py:591: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
[2024-10-10 20:11:04,863] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 8
[2024-10-10 20:11:48,539] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2024-10-10 20:11:48,541] [INFO] [logging.py:96:log_dist] [Rank 0] Using client Optimizer as basic optimizer
[2024-10-10 20:11:48,541] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
[2024-10-10 20:11:48,569] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = AdamW
[2024-10-10 20:11:48,569] [INFO] [utils.py:59:is_zero_supported_optimizer] Checking ZeRO support for optimizer=AdamW type=<class 'transformers.optimization.AdamW'>
[2024-10-10 20:11:48,569] [WARNING] [engine.py:1232:_do_optimizer_sanity_check] **** You are using ZeRO with an untested optimizer, proceed with caution *****
[2024-10-10 20:11:48,569] [INFO] [logging.py:96:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer, MiCS is enabled False, Hierarchical params gather False
[2024-10-10 20:11:48,569] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 3 optimizer
[2024-10-10 20:11:48,574] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 8
[2024-10-10 20:11:48,575] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 8
[2024-10-10 20:11:48,575] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 8
[2024-10-10 20:11:48,577] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 8
[2024-10-10 20:11:48,578] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 8
[2024-10-10 20:11:48,582] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 8
[2024-10-10 20:11:48,586] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 8
[2024-10-10 20:11:48,883] [INFO] [utils.py:781:see_memory_usage] Stage 3 initialize beginning
[2024-10-10 20:11:48,884] [INFO] [utils.py:782:see_memory_usage] MA 19.87 GB         Max_MA 19.87 GB         CA 20.18 GB         Max_CA 20 GB 
[2024-10-10 20:11:48,884] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 110.29 GB, percent = 10.9%
[2024-10-10 20:11:48,888] [INFO] [stage3.py:164:__init__] Reduce bucket size 500000000
[2024-10-10 20:11:48,888] [INFO] [stage3.py:165:__init__] Prefetch bucket size 0
[2024-10-10 20:11:49,130] [INFO] [utils.py:781:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
[2024-10-10 20:11:49,130] [INFO] [utils.py:782:see_memory_usage] MA 19.87 GB         Max_MA 19.87 GB         CA 20.18 GB         Max_CA 20 GB 
[2024-10-10 20:11:49,131] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 110.31 GB, percent = 10.9%
[2024-10-10 20:11:49,140] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 8
Parameter Offload: Total persistent parameters: 809251 in 379 params
[2024-10-10 20:11:49,654] [INFO] [utils.py:781:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
[2024-10-10 20:11:49,655] [INFO] [utils.py:782:see_memory_usage] MA 2.52 GB         Max_MA 19.89 GB         CA 20.34 GB         Max_CA 20 GB 
[2024-10-10 20:11:49,655] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 110.32 GB, percent = 10.9%
[2024-10-10 20:11:51,701] [INFO] [utils.py:781:see_memory_usage] Before creating fp16 partitions
[2024-10-10 20:11:51,702] [INFO] [utils.py:782:see_memory_usage] MA 2.52 GB         Max_MA 2.52 GB         CA 20.34 GB         Max_CA 20 GB 
[2024-10-10 20:11:51,702] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 110.34 GB, percent = 11.0%
[2024-10-10 20:11:54,341] [INFO] [utils.py:781:see_memory_usage] After creating fp16 partitions: 4
[2024-10-10 20:11:54,342] [INFO] [utils.py:782:see_memory_usage] MA 2.52 GB         Max_MA 2.52 GB         CA 4.13 GB         Max_CA 20 GB 
[2024-10-10 20:11:54,342] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 114.46 GB, percent = 11.4%
[2024-10-10 20:11:54,543] [INFO] [utils.py:781:see_memory_usage] Before creating fp32 partitions
[2024-10-10 20:11:54,543] [INFO] [utils.py:782:see_memory_usage] MA 2.52 GB         Max_MA 2.52 GB         CA 4.13 GB         Max_CA 4 GB 
[2024-10-10 20:11:54,544] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 110.86 GB, percent = 11.0%
[2024-10-10 20:11:54,755] [INFO] [utils.py:781:see_memory_usage] After creating fp32 partitions
[2024-10-10 20:11:54,755] [INFO] [utils.py:782:see_memory_usage] MA 7.09 GB         Max_MA 8.11 GB         CA 9.73 GB         Max_CA 10 GB 
[2024-10-10 20:11:54,756] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 109.9 GB, percent = 10.9%
[2024-10-10 20:11:54,943] [INFO] [utils.py:781:see_memory_usage] Before initializing optimizer states
[2024-10-10 20:11:54,943] [INFO] [utils.py:782:see_memory_usage] MA 7.09 GB         Max_MA 7.09 GB         CA 9.73 GB         Max_CA 10 GB 
[2024-10-10 20:11:54,943] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 109.9 GB, percent = 10.9%
[2024-10-10 20:11:55,149] [INFO] [utils.py:781:see_memory_usage] After initializing optimizer states
[2024-10-10 20:11:55,149] [INFO] [utils.py:782:see_memory_usage] MA 7.09 GB         Max_MA 10.82 GB         CA 13.46 GB         Max_CA 13 GB 
[2024-10-10 20:11:55,149] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 109.9 GB, percent = 10.9%
[2024-10-10 20:11:55,150] [INFO] [stage3.py:517:_setup_for_real_optimizer] optimizer state initialized

  0%|          | 0/16 [00:00<?, ?it/s]
  0%|          | 0/16 [00:00<?, ?it/s]
  0%|          | 0/16 [00:00<?, ?it/s]
  0%|          | 0/16 [00:00<?, ?it/s]
  0%|          | 0/16 [00:00<?, ?it/s]
  0%|          | 0/16 [00:00<?, ?it/s]
  0%|          | 0/16 [00:00<?, ?it/s][2024-10-10 20:11:56,406] [INFO] [utils.py:781:see_memory_usage] After initializing ZeRO optimizer
[2024-10-10 20:11:56,407] [INFO] [utils.py:782:see_memory_usage] MA 10.3 GB         Max_MA 12.26 GB         CA 15.42 GB         Max_CA 15 GB 
[2024-10-10 20:11:56,407] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 109.22 GB, percent = 10.8%
[2024-10-10 20:11:56,407] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedZeroOptimizer_Stage3
[2024-10-10 20:11:56,408] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler
[2024-10-10 20:11:56,408] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = <torch.optim.lr_scheduler.LambdaLR object at 0x155401d7b850>
[2024-10-10 20:11:56,408] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0, 0.0, 0.0], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
[2024-10-10 20:11:56,420] [INFO] [config.py:999:print] DeepSpeedEngine configuration:
[2024-10-10 20:11:56,420] [INFO] [config.py:1003:print]   activation_checkpointing_config  {
    "partition_activations": false, 
    "contiguous_memory_optimization": false, 
    "cpu_checkpointing": false, 
    "number_checkpoints": null, 
    "synchronize_checkpoint_boundary": false, 
    "profile": false
}
[2024-10-10 20:11:56,420] [INFO] [config.py:1003:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True, 'use_gds': False}
[2024-10-10 20:11:56,420] [INFO] [config.py:1003:print]   amp_enabled .................. False
[2024-10-10 20:11:56,420] [INFO] [config.py:1003:print]   amp_params ................... False
[2024-10-10 20:11:56,420] [INFO] [config.py:1003:print]   autotuning_config ............ {
    "enabled": false, 
    "start_step": null, 
    "end_step": null, 
    "metric_path": null, 
    "arg_mappings": null, 
    "metric": "throughput", 
    "model_info": null, 
    "results_dir": "autotuning_results", 
    "exps_dir": "autotuning_exps", 
    "overwrite": true, 
    "fast": true, 
    "start_profile_step": 3, 
    "end_profile_step": 5, 
    "tuner_type": "gridsearch", 
    "tuner_early_stopping": 5, 
    "tuner_num_trials": 50, 
    "model_info_path": null, 
    "mp_size": 1, 
    "max_train_batch_size": null, 
    "min_train_batch_size": 1, 
    "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
    "min_train_micro_batch_size_per_gpu": 1, 
    "num_tuning_micro_batch_sizes": 3
}
[2024-10-10 20:11:56,421] [INFO] [config.py:1003:print]   bfloat16_enabled ............. True
[2024-10-10 20:11:56,421] [INFO] [config.py:1003:print]   bfloat16_immediate_grad_update  False
[2024-10-10 20:11:56,421] [INFO] [config.py:1003:print]   checkpoint_parallel_write_pipeline  False
[2024-10-10 20:11:56,421] [INFO] [config.py:1003:print]   checkpoint_tag_validation_enabled  True
[2024-10-10 20:11:56,421] [INFO] [config.py:1003:print]   checkpoint_tag_validation_fail  False
[2024-10-10 20:11:56,421] [INFO] [config.py:1003:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x155401d7bd90>
[2024-10-10 20:11:56,421] [INFO] [config.py:1003:print]   communication_data_type ...... None
[2024-10-10 20:11:56,421] [INFO] [config.py:1003:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2024-10-10 20:11:56,421] [INFO] [config.py:1003:print]   curriculum_enabled_legacy .... False
[2024-10-10 20:11:56,421] [INFO] [config.py:1003:print]   curriculum_params_legacy ..... False
[2024-10-10 20:11:56,421] [INFO] [config.py:1003:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2024-10-10 20:11:56,421] [INFO] [config.py:1003:print]   data_efficiency_enabled ...... False
[2024-10-10 20:11:56,421] [INFO] [config.py:1003:print]   dataloader_drop_last ......... False
[2024-10-10 20:11:56,421] [INFO] [config.py:1003:print]   disable_allgather ............ False
[2024-10-10 20:11:56,421] [INFO] [config.py:1003:print]   dump_state ................... False
[2024-10-10 20:11:56,421] [INFO] [config.py:1003:print]   dynamic_loss_scale_args ...... None
[2024-10-10 20:11:56,421] [INFO] [config.py:1003:print]   eigenvalue_enabled ........... False
[2024-10-10 20:11:56,421] [INFO] [config.py:1003:print]   eigenvalue_gas_boundary_resolution  1
[2024-10-10 20:11:56,421] [INFO] [config.py:1003:print]   eigenvalue_layer_name ........ bert.encoder.layer
[2024-10-10 20:11:56,421] [INFO] [config.py:1003:print]   eigenvalue_layer_num ......... 0
[2024-10-10 20:11:56,421] [INFO] [config.py:1003:print]   eigenvalue_max_iter .......... 100
[2024-10-10 20:11:56,421] [INFO] [config.py:1003:print]   eigenvalue_stability ......... 1e-06
[2024-10-10 20:11:56,421] [INFO] [config.py:1003:print]   eigenvalue_tol ............... 0.01
[2024-10-10 20:11:56,421] [INFO] [config.py:1003:print]   eigenvalue_verbose ........... False
[2024-10-10 20:11:56,421] [INFO] [config.py:1003:print]   elasticity_enabled ........... False
[2024-10-10 20:11:56,421] [INFO] [config.py:1003:print]   flops_profiler_config ........ {
    "enabled": false, 
    "recompute_fwd_factor": 0.0, 
    "profile_step": 1, 
    "module_depth": -1, 
    "top_modules": 1, 
    "detailed": true, 
    "output_file": null
}
[2024-10-10 20:11:56,421] [INFO] [config.py:1003:print]   fp16_auto_cast ............... None
[2024-10-10 20:11:56,421] [INFO] [config.py:1003:print]   fp16_enabled ................. False
[2024-10-10 20:11:56,421] [INFO] [config.py:1003:print]   fp16_master_weights_and_gradients  False
[2024-10-10 20:11:56,421] [INFO] [config.py:1003:print]   global_rank .................. 0
[2024-10-10 20:11:56,421] [INFO] [config.py:1003:print]   grad_accum_dtype ............. None
[2024-10-10 20:11:56,421] [INFO] [config.py:1003:print]   gradient_accumulation_steps .. 2
[2024-10-10 20:11:56,421] [INFO] [config.py:1003:print]   gradient_clipping ............ 1.0
[2024-10-10 20:11:56,422] [INFO] [config.py:1003:print]   gradient_predivide_factor .... 1.0
[2024-10-10 20:11:56,422] [INFO] [config.py:1003:print]   graph_harvesting ............. False
[2024-10-10 20:11:56,422] [INFO] [config.py:1003:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2024-10-10 20:11:56,422] [INFO] [config.py:1003:print]   initial_dynamic_scale ........ 1
[2024-10-10 20:11:56,422] [INFO] [config.py:1003:print]   load_universal_checkpoint .... False
[2024-10-10 20:11:56,422] [INFO] [config.py:1003:print]   loss_scale ................... 1.0
[2024-10-10 20:11:56,422] [INFO] [config.py:1003:print]   memory_breakdown ............. False
[2024-10-10 20:11:56,422] [INFO] [config.py:1003:print]   mics_hierarchial_params_gather  False
[2024-10-10 20:11:56,422] [INFO] [config.py:1003:print]   mics_shard_size .............. -1
[2024-10-10 20:11:56,422] [INFO] [config.py:1003:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=None, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName')
[2024-10-10 20:11:56,422] [INFO] [config.py:1003:print]   nebula_config ................ {
    "enabled": false, 
    "persistent_storage_path": null, 
    "persistent_time_interval": 100, 
    "num_of_version_in_retention": 2, 
    "enable_nebula_load": true, 
    "load_path": null
}
[2024-10-10 20:11:56,422] [INFO] [config.py:1003:print]   optimizer_legacy_fusion ...... False
[2024-10-10 20:11:56,422] [INFO] [config.py:1003:print]   optimizer_name ............... None
[2024-10-10 20:11:56,422] [INFO] [config.py:1003:print]   optimizer_params ............. None
[2024-10-10 20:11:56,422] [INFO] [config.py:1003:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
[2024-10-10 20:11:56,422] [INFO] [config.py:1003:print]   pld_enabled .................. False
[2024-10-10 20:11:56,422] [INFO] [config.py:1003:print]   pld_params ................... False
[2024-10-10 20:11:56,422] [INFO] [config.py:1003:print]   prescale_gradients ........... False
[2024-10-10 20:11:56,422] [INFO] [config.py:1003:print]   scheduler_name ............... None
[2024-10-10 20:11:56,422] [INFO] [config.py:1003:print]   scheduler_params ............. None
[2024-10-10 20:11:56,422] [INFO] [config.py:1003:print]   seq_parallel_communication_data_type  torch.float32
[2024-10-10 20:11:56,422] [INFO] [config.py:1003:print]   sparse_attention ............. None
[2024-10-10 20:11:56,422] [INFO] [config.py:1003:print]   sparse_gradients_enabled ..... False
[2024-10-10 20:11:56,422] [INFO] [config.py:1003:print]   steps_per_print .............. 10
[2024-10-10 20:11:56,422] [INFO] [config.py:1003:print]   timers_config ................ enabled=True synchronized=True
[2024-10-10 20:11:56,422] [INFO] [config.py:1003:print]   train_batch_size ............. 16
[2024-10-10 20:11:56,422] [INFO] [config.py:1003:print]   train_micro_batch_size_per_gpu  1
[2024-10-10 20:11:56,422] [INFO] [config.py:1003:print]   use_data_before_expert_parallel_  False
[2024-10-10 20:11:56,422] [INFO] [config.py:1003:print]   use_node_local_storage ....... False
[2024-10-10 20:11:56,422] [INFO] [config.py:1003:print]   wall_clock_breakdown ......... False
[2024-10-10 20:11:56,422] [INFO] [config.py:1003:print]   weight_quantization_config ... None
[2024-10-10 20:11:56,422] [INFO] [config.py:1003:print]   world_size ................... 8
[2024-10-10 20:11:56,423] [INFO] [config.py:1003:print]   zero_allow_untested_optimizer  True
[2024-10-10 20:11:56,423] [INFO] [config.py:1003:print]   zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500000000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='none', nvme_path=None, buffer_count=5, buffer_size=100000000, max_in_cpu=1000000000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='none', nvme_path=None, buffer_count=4, pin_memory=False, pipeline_read=False, pipeline_write=False, fast_init=False, ratio=1.0) sub_group_size=1000000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=0 param_persistence_threshold=10000 model_persistence_threshold=9223372036854775807 max_live_parameters=30000000 max_reuse_distance=1000000000 gather_16bit_weights_on_model_save=False use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=False pipeline_loading_checkpoint=False override_module_apply=True
[2024-10-10 20:11:56,423] [INFO] [config.py:1003:print]   zero_enabled ................. True
[2024-10-10 20:11:56,423] [INFO] [config.py:1003:print]   zero_force_ds_cpu_optimizer .. False
[2024-10-10 20:11:56,423] [INFO] [config.py:1003:print]   zero_optimization_stage ...... 3
[2024-10-10 20:11:56,423] [INFO] [config.py:989:print_user_config]   json = {
    "train_batch_size": 16, 
    "train_micro_batch_size_per_gpu": 1, 
    "steps_per_print": 10, 
    "zero_optimization": {
        "stage": 3, 
        "offload_param": {
            "device": "none"
        }, 
        "offload_optimizer": {
            "device": "none"
        }, 
        "stage3_param_persistence_threshold": 1.000000e+04, 
        "stage3_max_live_parameters": 3.000000e+07, 
        "stage3_prefetch_bucket_size": 0, 
        "memory_efficient_linear": false
    }, 
    "zero_allow_untested_optimizer": true, 
    "zero_force_ds_cpu_optimizer": false, 
    "fp16": {
        "enabled": false, 
        "loss_scale_window": 100
    }, 
    "bf16": {
        "enabled": true
    }, 
    "gradient_clipping": 1.0, 
    "prescale_gradients": false, 
    "wall_clock_breakdown": false, 
    "hybrid_engine": {
        "enabled": false, 
        "max_out_tokens": 512, 
        "inference_tp_size": 1, 
        "release_inference_cache": false, 
        "pin_parameters": true, 
        "tp_gather_partition_size": 8
    }
}
***** Before training *****
***** Evaluation Begin *****

  0%|          | 0/16 [00:00<?, ?it/s]
  6%|▋         | 1/16 [00:07<01:50,  7.38s/it]
  6%|▋         | 1/16 [00:07<01:56,  7.77s/it]
  6%|▋         | 1/16 [00:07<01:56,  7.74s/it]
  6%|▋         | 1/16 [00:07<01:56,  7.78s/it]
  6%|▋         | 1/16 [00:07<01:56,  7.77s/it]
  6%|▋         | 1/16 [00:07<01:56,  7.78s/it]
  6%|▋         | 1/16 [00:07<01:56,  7.78s/it]
  6%|▋         | 1/16 [00:07<01:56,  7.76s/it]
 12%|█▎        | 2/16 [00:15<01:45,  7.54s/it]
 12%|█▎        | 2/16 [00:14<01:43,  7.38s/it]
 12%|█▎        | 2/16 [00:15<01:45,  7.55s/it]
 12%|█▎        | 2/16 [00:15<01:45,  7.54s/it]
 12%|█▎        | 2/16 [00:15<01:45,  7.55s/it]
 12%|█▎        | 2/16 [00:15<01:45,  7.53s/it]
 12%|█▎        | 2/16 [00:15<01:45,  7.54s/it]
 12%|█▎        | 2/16 [00:15<01:45,  7.55s/it]
 19%|█▉        | 3/16 [00:22<01:35,  7.38s/it]
 19%|█▉        | 3/16 [00:22<01:35,  7.38s/it]
 19%|█▉        | 3/16 [00:22<01:35,  7.38s/it]
 19%|█▉        | 3/16 [00:22<01:35,  7.38s/it]
 19%|█▉        | 3/16 [00:22<01:35,  7.38s/it]
 19%|█▉        | 3/16 [00:21<01:34,  7.30s/it]
 19%|█▉        | 3/16 [00:22<01:35,  7.38s/it]
 19%|█▉        | 3/16 [00:22<01:36,  7.39s/it]
 25%|██▌       | 4/16 [00:29<01:27,  7.27s/it]
 25%|██▌       | 4/16 [00:29<01:27,  7.26s/it]
 25%|██▌       | 4/16 [00:29<01:27,  7.27s/it]
 25%|██▌       | 4/16 [00:29<01:27,  7.27s/it]
 25%|██▌       | 4/16 [00:29<01:26,  7.22s/it]
 25%|██▌       | 4/16 [00:29<01:27,  7.27s/it]
 25%|██▌       | 4/16 [00:29<01:27,  7.27s/it]
 25%|██▌       | 4/16 [00:29<01:27,  7.27s/it]
 31%|███▏      | 5/16 [00:36<01:19,  7.23s/it]
 31%|███▏      | 5/16 [00:36<01:19,  7.26s/it]
 31%|███▏      | 5/16 [00:36<01:19,  7.27s/it]
 31%|███▏      | 5/16 [00:36<01:19,  7.26s/it]
 31%|███▏      | 5/16 [00:36<01:19,  7.26s/it]
 31%|███▏      | 5/16 [00:36<01:19,  7.27s/it]
 31%|███▏      | 5/16 [00:36<01:19,  7.27s/it]
 31%|███▏      | 5/16 [00:36<01:19,  7.27s/it]
 38%|███▊      | 6/16 [00:43<01:12,  7.21s/it]
 38%|███▊      | 6/16 [00:43<01:12,  7.21s/it]
 38%|███▊      | 6/16 [00:43<01:12,  7.21s/it]
 38%|███▊      | 6/16 [00:43<01:12,  7.22s/it]
 38%|███▊      | 6/16 [00:43<01:12,  7.21s/it]
 38%|███▊      | 6/16 [00:43<01:11,  7.19s/it]
 38%|███▊      | 6/16 [00:43<01:12,  7.21s/it]
 38%|███▊      | 6/16 [00:43<01:12,  7.22s/it]
 44%|████▍     | 7/16 [00:50<01:04,  7.18s/it]
 44%|████▍     | 7/16 [00:50<01:04,  7.20s/it]
 44%|████▍     | 7/16 [00:50<01:04,  7.20s/it]
 44%|████▍     | 7/16 [00:50<01:04,  7.20s/it]
 44%|████▍     | 7/16 [00:50<01:04,  7.20s/it]
 44%|████▍     | 7/16 [00:50<01:04,  7.20s/it]
 44%|████▍     | 7/16 [00:50<01:04,  7.20s/it]
 44%|████▍     | 7/16 [00:50<01:04,  7.20s/it]
 50%|█████     | 8/16 [00:58<00:57,  7.15s/it]
 50%|█████     | 8/16 [00:58<00:57,  7.15s/it]
 50%|█████     | 8/16 [00:57<00:57,  7.14s/it]
 50%|█████     | 8/16 [00:58<00:57,  7.15s/it]
 50%|█████     | 8/16 [00:58<00:57,  7.15s/it]
 50%|█████     | 8/16 [00:58<00:57,  7.15s/it]
 50%|█████     | 8/16 [00:58<00:57,  7.15s/it]
 50%|█████     | 8/16 [00:58<00:57,  7.16s/it]
 56%|█████▋    | 9/16 [01:04<00:50,  7.15s/it]
 56%|█████▋    | 9/16 [01:05<00:50,  7.16s/it]
 56%|█████▋    | 9/16 [01:05<00:50,  7.16s/it]
 56%|█████▋    | 9/16 [01:05<00:50,  7.16s/it]
 56%|█████▋    | 9/16 [01:05<00:50,  7.16s/it]
 56%|█████▋    | 9/16 [01:05<00:50,  7.16s/it]
 56%|█████▋    | 9/16 [01:05<00:50,  7.16s/it]
 56%|█████▋    | 9/16 [01:05<00:50,  7.16s/it]
 62%|██████▎   | 10/16 [01:12<00:42,  7.13s/it]
 62%|██████▎   | 10/16 [01:12<00:42,  7.13s/it]
 62%|██████▎   | 10/16 [01:12<00:42,  7.13s/it]
 62%|██████▎   | 10/16 [01:11<00:42,  7.13s/it]
 62%|██████▎   | 10/16 [01:12<00:42,  7.13s/it]
 62%|██████▎   | 10/16 [01:12<00:42,  7.13s/it]
 62%|██████▎   | 10/16 [01:12<00:42,  7.14s/it]
 62%|██████▎   | 10/16 [01:12<00:42,  7.14s/it]
 69%|██████▉   | 11/16 [01:19<00:35,  7.18s/it]
 69%|██████▉   | 11/16 [01:19<00:35,  7.18s/it]
 69%|██████▉   | 11/16 [01:19<00:35,  7.18s/it]
 69%|██████▉   | 11/16 [01:19<00:35,  7.18s/it]
 69%|██████▉   | 11/16 [01:19<00:35,  7.18s/it]
 69%|██████▉   | 11/16 [01:19<00:35,  7.18s/it]
 69%|██████▉   | 11/16 [01:19<00:35,  7.18s/it]
 69%|██████▉   | 11/16 [01:19<00:35,  7.18s/it]
 75%|███████▌  | 12/16 [01:26<00:28,  7.22s/it]
 75%|███████▌  | 12/16 [01:26<00:28,  7.22s/it]
 75%|███████▌  | 12/16 [01:26<00:28,  7.22s/it]
 75%|███████▌  | 12/16 [01:26<00:28,  7.22s/it]
 75%|███████▌  | 12/16 [01:26<00:28,  7.22s/it]
 75%|███████▌  | 12/16 [01:26<00:28,  7.22s/it]
 75%|███████▌  | 12/16 [01:26<00:28,  7.22s/it]
 75%|███████▌  | 12/16 [01:26<00:28,  7.22s/it]
 81%|████████▏ | 13/16 [01:34<00:21,  7.20s/it]
 81%|████████▏ | 13/16 [01:34<00:21,  7.20s/it]
 81%|████████▏ | 13/16 [01:34<00:21,  7.20s/it]
 81%|████████▏ | 13/16 [01:34<00:21,  7.20s/it]
 81%|████████▏ | 13/16 [01:34<00:21,  7.20s/it]
 81%|████████▏ | 13/16 [01:33<00:21,  7.20s/it]
 81%|████████▏ | 13/16 [01:34<00:21,  7.20s/it]
 81%|████████▏ | 13/16 [01:34<00:21,  7.20s/it]
 88%|████████▊ | 14/16 [01:41<00:14,  7.22s/it]
 88%|████████▊ | 14/16 [01:40<00:14,  7.22s/it]
 88%|████████▊ | 14/16 [01:41<00:14,  7.22s/it]
 88%|████████▊ | 14/16 [01:41<00:14,  7.22s/it]
 88%|████████▊ | 14/16 [01:41<00:14,  7.22s/it]
 88%|████████▊ | 14/16 [01:41<00:14,  7.22s/it]
 88%|████████▊ | 14/16 [01:41<00:14,  7.22s/it]
 88%|████████▊ | 14/16 [01:41<00:14,  7.22s/it]
 94%|█████████▍| 15/16 [01:48<00:07,  7.23s/it]
 94%|█████████▍| 15/16 [01:48<00:07,  7.23s/it]
 94%|█████████▍| 15/16 [01:48<00:07,  7.23s/it]
 94%|█████████▍| 15/16 [01:48<00:07,  7.23s/it]
 94%|█████████▍| 15/16 [01:48<00:07,  7.23s/it]
 94%|█████████▍| 15/16 [01:48<00:07,  7.23s/it]
 94%|█████████▍| 15/16 [01:48<00:07,  7.23s/it]
 94%|█████████▍| 15/16 [01:48<00:07,  7.23s/it]
100%|██████████| 16/16 [01:53<00:00,  6.54s/it]
100%|██████████| 16/16 [01:53<00:00,  7.09s/it]

100%|██████████| 16/16 [01:53<00:00,  6.54s/it]
100%|██████████| 16/16 [01:53<00:00,  7.09s/it]

100%|██████████| 16/16 [01:53<00:00,  6.54s/it]
100%|██████████| 16/16 [01:53<00:00,  7.09s/it]

100%|██████████| 16/16 [01:53<00:00,  6.54s/it]
100%|██████████| 16/16 [01:53<00:00,  7.09s/it]

100%|██████████| 16/16 [01:53<00:00,  6.54s/it]
100%|██████████| 16/16 [01:53<00:00,  7.09s/it]

100%|██████████| 16/16 [01:53<00:00,  6.54s/it]
100%|██████████| 16/16 [01:53<00:00,  7.09s/it]

100%|██████████| 16/16 [01:53<00:00,  6.54s/it]
  0%|          | 0/10266 [00:00<?, ?it/s]
100%|██████████| 16/16 [01:53<00:00,  7.07s/it]

100%|██████████| 16/16 [01:53<00:00,  6.54s/it]
100%|██████████| 16/16 [01:53<00:00,  7.09s/it]

  0%|          | 0/10266 [00:00<?, ?it/s]
  0%|          | 0/10266 [00:00<?, ?it/s]Eval accuracy: 0.464, Avg of Reward Scores: 0.464
***** Running training *****
Beginning of Epoch 1/1, Total Micro Batches 10266

  0%|          | 0/10266 [00:00<?, ?it/s]
  0%|          | 0/10266 [00:00<?, ?it/s]
  0%|          | 0/10266 [00:00<?, ?it/s]
  0%|          | 0/10266 [00:00<?, ?it/s]
  0%|          | 0/10266 [00:00<?, ?it/s]/localnvme/application/sc_new/miniconda3/envs/wcl_rlhf_new/lib/python3.10/site-packages/torch/utils/checkpoint.py:90: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn(
/localnvme/application/sc_new/miniconda3/envs/wcl_rlhf_new/lib/python3.10/site-packages/torch/utils/checkpoint.py:90: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn(
/localnvme/application/sc_new/miniconda3/envs/wcl_rlhf_new/lib/python3.10/site-packages/torch/utils/checkpoint.py:90: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn(
/localnvme/application/sc_new/miniconda3/envs/wcl_rlhf_new/lib/python3.10/site-packages/torch/utils/checkpoint.py:90: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn(
/localnvme/application/sc_new/miniconda3/envs/wcl_rlhf_new/lib/python3.10/site-packages/torch/utils/checkpoint.py:90: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn(
/localnvme/application/sc_new/miniconda3/envs/wcl_rlhf_new/lib/python3.10/site-packages/torch/utils/checkpoint.py:90: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn(
/localnvme/application/sc_new/miniconda3/envs/wcl_rlhf_new/lib/python3.10/site-packages/torch/utils/checkpoint.py:90: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn(
/localnvme/application/sc_new/miniconda3/envs/wcl_rlhf_new/lib/python3.10/site-packages/torch/utils/checkpoint.py:90: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn(
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
Epoch 1, Step: 1, Loss: 0.6796875, Accuracy: 1.0

  0%|          | 1/10266 [00:08<23:53:54,  8.38s/it]
  0%|          | 1/10266 [00:08<23:52:57,  8.38s/it]
  0%|          | 1/10266 [00:08<23:55:00,  8.39s/it]
  0%|          | 1/10266 [00:08<23:54:59,  8.39s/it]
  0%|          | 1/10266 [00:08<23:55:24,  8.39s/it]
  0%|          | 1/10266 [00:08<23:51:41,  8.37s/it]
  0%|          | 1/10266 [00:08<23:54:13,  8.38s/it]
  0%|          | 1/10266 [00:08<23:52:59,  8.38s/it]Epoch 1, Step: 2, Loss: 0.71875, Accuracy: 0.0

  0%|          | 2/10266 [00:17<24:54:51,  8.74s/it]
  0%|          | 2/10266 [00:17<24:54:24,  8.74s/it]
  0%|          | 2/10266 [00:17<24:54:51,  8.74s/it]
  0%|          | 2/10266 [00:17<24:54:01,  8.73s/it]
  0%|          | 2/10266 [00:17<24:54:32,  8.74s/it]
  0%|          | 2/10266 [00:17<24:55:01,  8.74s/it]
  0%|          | 2/10266 [00:17<24:54:01,  8.73s/it]
  0%|          | 2/10266 [00:17<24:53:30,  8.73s/it]Epoch 1, Step: 3, Loss: 0.7291666666666666, Accuracy: 0.0

  0%|          | 3/10266 [00:24<23:16:53,  8.17s/it]
  0%|          | 3/10266 [00:24<23:16:48,  8.17s/it]
  0%|          | 3/10266 [00:24<23:16:48,  8.17s/it]
  0%|          | 3/10266 [00:24<23:16:21,  8.16s/it]
  0%|          | 3/10266 [00:24<23:16:37,  8.17s/it]
  0%|          | 3/10266 [00:24<23:16:33,  8.16s/it]
  0%|          | 3/10266 [00:24<23:16:04,  8.16s/it]
  0%|          | 3/10266 [00:24<23:16:21,  8.16s/it]Epoch 1, Step: 4, Loss: 0.703125, Accuracy: 0.0

  0%|          | 4/10266 [00:32<22:35:24,  7.92s/it]
  0%|          | 4/10266 [00:32<22:35:12,  7.92s/it]
  0%|          | 4/10266 [00:32<22:34:54,  7.92s/it]
  0%|          | 4/10266 [00:32<22:35:05,  7.92s/it]
  0%|          | 4/10266 [00:32<22:35:05,  7.92s/it]
  0%|          | 4/10266 [00:32<22:35:21,  7.92s/it]
  0%|          | 4/10266 [00:32<22:35:15,  7.92s/it]
  0%|          | 4/10266 [00:32<22:35:22,  7.92s/it]Epoch 1, Step: 5, Loss: 0.7, Accuracy: 1.0

  0%|          | 5/10266 [00:39<22:07:07,  7.76s/it]
  0%|          | 5/10266 [00:39<22:06:59,  7.76s/it]
  0%|          | 5/10266 [00:39<22:07:05,  7.76s/it]
  0%|          | 5/10266 [00:39<22:07:05,  7.76s/it]
  0%|          | 5/10266 [00:39<22:07:01,  7.76s/it]
  0%|          | 5/10266 [00:39<22:06:48,  7.76s/it]
  0%|          | 5/10266 [00:39<22:06:54,  7.76s/it]
  0%|          | 5/10266 [00:39<22:06:54,  7.76s/it]Epoch 1, Step: 6, Loss: 0.6927083333333334, Accuracy: 0.0

  0%|          | 6/10266 [00:47<21:54:55,  7.69s/it]
  0%|          | 6/10266 [00:47<21:54:50,  7.69s/it]
  0%|          | 6/10266 [00:47<21:54:54,  7.69s/it]
  0%|          | 6/10266 [00:47<21:54:54,  7.69s/it]
  0%|          | 6/10266 [00:47<21:54:47,  7.69s/it]
  0%|          | 6/10266 [00:47<21:54:51,  7.69s/it]
  0%|          | 6/10266 [00:47<21:54:47,  7.69s/it]
  0%|          | 6/10266 [00:47<21:54:43,  7.69s/it]Epoch 1, Step: 7, Loss: 0.6785714285714286, Accuracy: 1.0

  0%|          | 7/10266 [00:54<21:45:20,  7.63s/it]
  0%|          | 7/10266 [00:54<21:45:20,  7.63s/it]
  0%|          | 7/10266 [00:54<21:45:18,  7.63s/it]
  0%|          | 7/10266 [00:54<21:45:18,  7.63s/it]
  0%|          | 7/10266 [00:54<21:45:12,  7.63s/it]
  0%|          | 7/10266 [00:54<21:45:16,  7.63s/it]
  0%|          | 7/10266 [00:54<21:45:21,  7.63s/it]
  0%|          | 7/10266 [00:54<21:45:16,  7.63s/it][2024-10-10 20:14:47,871] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 2199283
hhhhzzzzz commented 2 weeks ago

Thank you so much for your help!

https://github.com/NiuTrans/Vision-LLM-Alignment/blob/be1756bcecb6d84585ada832566764bdcc9043bb/data/convert_to_llava_reward_dataset.py#L11-L14

Could you please provide the roles for mistral and llama3.2-version?

Thanks!

if-noc commented 2 weeks ago

Thank you so much for your help!

https://github.com/NiuTrans/Vision-LLM-Alignment/blob/be1756bcecb6d84585ada832566764bdcc9043bb/data/convert_to_llava_reward_dataset.py#L11-L14

Could you please provide the roles for mistral and llama3.2-version?

Thanks!

We have updated this script!

The roles used are as follows:

Mistral:

Llama-3.2:

Reference: https://github.com/lm-sys/FastChat/blob/main/fastchat/conversation.py https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_2

hhhhzzzzz commented 2 weeks ago

Thank you so much for your help!