Closed hhhhzzzzz closed 2 weeks ago
This seems to be a problem with the Python environment. Could you please check the latest requirements.txt? Some libraries need to be updated when using llama-3.2-vision. You can also share your used version of these libraries (including the Cuda version), and I'll check them.
Hi,
When I use per_device_train_batch_size=1, it will cause the bug.
Thanks for your feedback, we will fix it as soon as possible.
And could you provide roles for mistral and llama3.2 (reward model training)?
Thanks!
Hi,
When I use per_device_train_batch_size=1, it will cause the bug.
We have fixed some bugs related to training a reward model with LLaMA-3.2-vision. Please update your code using the latest version (git pull
). Below is the log from our test. Let us know if you have any further questions.
[2024-10-10 20:10:22,503] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-10 20:10:24,049] [WARNING] [runner.py:212:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2024-10-10 20:10:24,049] [INFO] [runner.py:585:main] cmd = /localnvme/application/sc_new/miniconda3/envs/wcl_rlhf_new/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=12335 --enable_each_rank_log=None training/reward_model_training/rm_training_main.py --max_seq_len 2048 --image_folder /localnvme/application/sc_new/wangchenglong_56/rlhf_llama_vision/data/RLAIF-V-Dataset/images --template llama-3.2-vision --data_path /localnvme/application/sc_new/wangchenglong_56/rlhf_llama_vision/data/RLAIF-V-Dataset/rlaif_v_dataset_train.json --eval_data_path /localnvme/application/sc_new/wangchenglong_56/rlhf_llama_vision/data/RLAIF-V-Dataset/rlaif_v_dataset_test.json --dataset_names llava_reward --dataset_samples all --dataset_concatenate_samples 1 --max_num_image_per_sample 8 --lm_reward_model_name_or_path none --vision_reward_model_name_or_path none --gradient_checkpointing --vis_proj baseline --gradient_accumulation_steps 2 --zero_stage 3 --learning_rate 1e-6 --num_warmup_steps 0.1 --per_device_train_batch_size 1 --per_device_eval_batch_size 8 --eval_step 200 --deepspeed --output_dir models/test --num_train_epochs 1 --lang_decoder_update --enable_mmca_attention --model_architecture llama-3.2-vision --trained_reward_model none --save_step 9900 --precision bf16 --ranked_candidate_num 2 --from_checkpoint /localnvme/application/sc_new/wangchenglong_56/base_models/llama-3.2-11b-vision-instruct
[2024-10-10 20:10:25,115] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-10 20:10:26,611] [INFO] [launch.py:146:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
[2024-10-10 20:10:26,611] [INFO] [launch.py:152:main] nnodes=1, num_local_procs=8, node_rank=0
[2024-10-10 20:10:26,611] [INFO] [launch.py:163:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[2024-10-10 20:10:26,611] [INFO] [launch.py:164:main] dist_world_size=8
[2024-10-10 20:10:26,611] [INFO] [launch.py:168:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
[2024-10-10 20:10:26,611] [INFO] [launch.py:256:main] process 2199283 spawned with command: ['/localnvme/application/sc_new/miniconda3/envs/wcl_rlhf_new/bin/python', '-u', 'training/reward_model_training/rm_training_main.py', '--local_rank=0', '--max_seq_len', '2048', '--image_folder', '/localnvme/application/sc_new/wangchenglong_56/rlhf_llama_vision/data/RLAIF-V-Dataset/images', '--template', 'llama-3.2-vision', '--data_path', '/localnvme/application/sc_new/wangchenglong_56/rlhf_llama_vision/data/RLAIF-V-Dataset/rlaif_v_dataset_train.json', '--eval_data_path', '/localnvme/application/sc_new/wangchenglong_56/rlhf_llama_vision/data/RLAIF-V-Dataset/rlaif_v_dataset_test.json', '--dataset_names', 'llava_reward', '--dataset_samples', 'all', '--dataset_concatenate_samples', '1', '--max_num_image_per_sample', '8', '--lm_reward_model_name_or_path', 'none', '--vision_reward_model_name_or_path', 'none', '--gradient_checkpointing', '--vis_proj', 'baseline', '--gradient_accumulation_steps', '2', '--zero_stage', '3', '--learning_rate', '1e-6', '--num_warmup_steps', '0.1', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '8', '--eval_step', '200', '--deepspeed', '--output_dir', 'models/test', '--num_train_epochs', '1', '--lang_decoder_update', '--enable_mmca_attention', '--model_architecture', 'llama-3.2-vision', '--trained_reward_model', 'none', '--save_step', '9900', '--precision', 'bf16', '--ranked_candidate_num', '2', '--from_checkpoint', '/localnvme/application/sc_new/wangchenglong_56/base_models/llama-3.2-11b-vision-instruct']
[2024-10-10 20:10:26,612] [INFO] [launch.py:256:main] process 2199284 spawned with command: ['/localnvme/application/sc_new/miniconda3/envs/wcl_rlhf_new/bin/python', '-u', 'training/reward_model_training/rm_training_main.py', '--local_rank=1', '--max_seq_len', '2048', '--image_folder', '/localnvme/application/sc_new/wangchenglong_56/rlhf_llama_vision/data/RLAIF-V-Dataset/images', '--template', 'llama-3.2-vision', '--data_path', '/localnvme/application/sc_new/wangchenglong_56/rlhf_llama_vision/data/RLAIF-V-Dataset/rlaif_v_dataset_train.json', '--eval_data_path', '/localnvme/application/sc_new/wangchenglong_56/rlhf_llama_vision/data/RLAIF-V-Dataset/rlaif_v_dataset_test.json', '--dataset_names', 'llava_reward', '--dataset_samples', 'all', '--dataset_concatenate_samples', '1', '--max_num_image_per_sample', '8', '--lm_reward_model_name_or_path', 'none', '--vision_reward_model_name_or_path', 'none', '--gradient_checkpointing', '--vis_proj', 'baseline', '--gradient_accumulation_steps', '2', '--zero_stage', '3', '--learning_rate', '1e-6', '--num_warmup_steps', '0.1', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '8', '--eval_step', '200', '--deepspeed', '--output_dir', 'models/test', '--num_train_epochs', '1', '--lang_decoder_update', '--enable_mmca_attention', '--model_architecture', 'llama-3.2-vision', '--trained_reward_model', 'none', '--save_step', '9900', '--precision', 'bf16', '--ranked_candidate_num', '2', '--from_checkpoint', '/localnvme/application/sc_new/wangchenglong_56/base_models/llama-3.2-11b-vision-instruct']
[2024-10-10 20:10:26,612] [INFO] [launch.py:256:main] process 2199285 spawned with command: ['/localnvme/application/sc_new/miniconda3/envs/wcl_rlhf_new/bin/python', '-u', 'training/reward_model_training/rm_training_main.py', '--local_rank=2', '--max_seq_len', '2048', '--image_folder', '/localnvme/application/sc_new/wangchenglong_56/rlhf_llama_vision/data/RLAIF-V-Dataset/images', '--template', 'llama-3.2-vision', '--data_path', '/localnvme/application/sc_new/wangchenglong_56/rlhf_llama_vision/data/RLAIF-V-Dataset/rlaif_v_dataset_train.json', '--eval_data_path', '/localnvme/application/sc_new/wangchenglong_56/rlhf_llama_vision/data/RLAIF-V-Dataset/rlaif_v_dataset_test.json', '--dataset_names', 'llava_reward', '--dataset_samples', 'all', '--dataset_concatenate_samples', '1', '--max_num_image_per_sample', '8', '--lm_reward_model_name_or_path', 'none', '--vision_reward_model_name_or_path', 'none', '--gradient_checkpointing', '--vis_proj', 'baseline', '--gradient_accumulation_steps', '2', '--zero_stage', '3', '--learning_rate', '1e-6', '--num_warmup_steps', '0.1', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '8', '--eval_step', '200', '--deepspeed', '--output_dir', 'models/test', '--num_train_epochs', '1', '--lang_decoder_update', '--enable_mmca_attention', '--model_architecture', 'llama-3.2-vision', '--trained_reward_model', 'none', '--save_step', '9900', '--precision', 'bf16', '--ranked_candidate_num', '2', '--from_checkpoint', '/localnvme/application/sc_new/wangchenglong_56/base_models/llama-3.2-11b-vision-instruct']
[2024-10-10 20:10:26,613] [INFO] [launch.py:256:main] process 2199286 spawned with command: ['/localnvme/application/sc_new/miniconda3/envs/wcl_rlhf_new/bin/python', '-u', 'training/reward_model_training/rm_training_main.py', '--local_rank=3', '--max_seq_len', '2048', '--image_folder', '/localnvme/application/sc_new/wangchenglong_56/rlhf_llama_vision/data/RLAIF-V-Dataset/images', '--template', 'llama-3.2-vision', '--data_path', '/localnvme/application/sc_new/wangchenglong_56/rlhf_llama_vision/data/RLAIF-V-Dataset/rlaif_v_dataset_train.json', '--eval_data_path', '/localnvme/application/sc_new/wangchenglong_56/rlhf_llama_vision/data/RLAIF-V-Dataset/rlaif_v_dataset_test.json', '--dataset_names', 'llava_reward', '--dataset_samples', 'all', '--dataset_concatenate_samples', '1', '--max_num_image_per_sample', '8', '--lm_reward_model_name_or_path', 'none', '--vision_reward_model_name_or_path', 'none', '--gradient_checkpointing', '--vis_proj', 'baseline', '--gradient_accumulation_steps', '2', '--zero_stage', '3', '--learning_rate', '1e-6', '--num_warmup_steps', '0.1', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '8', '--eval_step', '200', '--deepspeed', '--output_dir', 'models/test', '--num_train_epochs', '1', '--lang_decoder_update', '--enable_mmca_attention', '--model_architecture', 'llama-3.2-vision', '--trained_reward_model', 'none', '--save_step', '9900', '--precision', 'bf16', '--ranked_candidate_num', '2', '--from_checkpoint', '/localnvme/application/sc_new/wangchenglong_56/base_models/llama-3.2-11b-vision-instruct']
[2024-10-10 20:10:26,613] [INFO] [launch.py:256:main] process 2199287 spawned with command: ['/localnvme/application/sc_new/miniconda3/envs/wcl_rlhf_new/bin/python', '-u', 'training/reward_model_training/rm_training_main.py', '--local_rank=4', '--max_seq_len', '2048', '--image_folder', '/localnvme/application/sc_new/wangchenglong_56/rlhf_llama_vision/data/RLAIF-V-Dataset/images', '--template', 'llama-3.2-vision', '--data_path', '/localnvme/application/sc_new/wangchenglong_56/rlhf_llama_vision/data/RLAIF-V-Dataset/rlaif_v_dataset_train.json', '--eval_data_path', '/localnvme/application/sc_new/wangchenglong_56/rlhf_llama_vision/data/RLAIF-V-Dataset/rlaif_v_dataset_test.json', '--dataset_names', 'llava_reward', '--dataset_samples', 'all', '--dataset_concatenate_samples', '1', '--max_num_image_per_sample', '8', '--lm_reward_model_name_or_path', 'none', '--vision_reward_model_name_or_path', 'none', '--gradient_checkpointing', '--vis_proj', 'baseline', '--gradient_accumulation_steps', '2', '--zero_stage', '3', '--learning_rate', '1e-6', '--num_warmup_steps', '0.1', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '8', '--eval_step', '200', '--deepspeed', '--output_dir', 'models/test', '--num_train_epochs', '1', '--lang_decoder_update', '--enable_mmca_attention', '--model_architecture', 'llama-3.2-vision', '--trained_reward_model', 'none', '--save_step', '9900', '--precision', 'bf16', '--ranked_candidate_num', '2', '--from_checkpoint', '/localnvme/application/sc_new/wangchenglong_56/base_models/llama-3.2-11b-vision-instruct']
[2024-10-10 20:10:26,614] [INFO] [launch.py:256:main] process 2199288 spawned with command: ['/localnvme/application/sc_new/miniconda3/envs/wcl_rlhf_new/bin/python', '-u', 'training/reward_model_training/rm_training_main.py', '--local_rank=5', '--max_seq_len', '2048', '--image_folder', '/localnvme/application/sc_new/wangchenglong_56/rlhf_llama_vision/data/RLAIF-V-Dataset/images', '--template', 'llama-3.2-vision', '--data_path', '/localnvme/application/sc_new/wangchenglong_56/rlhf_llama_vision/data/RLAIF-V-Dataset/rlaif_v_dataset_train.json', '--eval_data_path', '/localnvme/application/sc_new/wangchenglong_56/rlhf_llama_vision/data/RLAIF-V-Dataset/rlaif_v_dataset_test.json', '--dataset_names', 'llava_reward', '--dataset_samples', 'all', '--dataset_concatenate_samples', '1', '--max_num_image_per_sample', '8', '--lm_reward_model_name_or_path', 'none', '--vision_reward_model_name_or_path', 'none', '--gradient_checkpointing', '--vis_proj', 'baseline', '--gradient_accumulation_steps', '2', '--zero_stage', '3', '--learning_rate', '1e-6', '--num_warmup_steps', '0.1', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '8', '--eval_step', '200', '--deepspeed', '--output_dir', 'models/test', '--num_train_epochs', '1', '--lang_decoder_update', '--enable_mmca_attention', '--model_architecture', 'llama-3.2-vision', '--trained_reward_model', 'none', '--save_step', '9900', '--precision', 'bf16', '--ranked_candidate_num', '2', '--from_checkpoint', '/localnvme/application/sc_new/wangchenglong_56/base_models/llama-3.2-11b-vision-instruct']
[2024-10-10 20:10:26,614] [INFO] [launch.py:256:main] process 2199289 spawned with command: ['/localnvme/application/sc_new/miniconda3/envs/wcl_rlhf_new/bin/python', '-u', 'training/reward_model_training/rm_training_main.py', '--local_rank=6', '--max_seq_len', '2048', '--image_folder', '/localnvme/application/sc_new/wangchenglong_56/rlhf_llama_vision/data/RLAIF-V-Dataset/images', '--template', 'llama-3.2-vision', '--data_path', '/localnvme/application/sc_new/wangchenglong_56/rlhf_llama_vision/data/RLAIF-V-Dataset/rlaif_v_dataset_train.json', '--eval_data_path', '/localnvme/application/sc_new/wangchenglong_56/rlhf_llama_vision/data/RLAIF-V-Dataset/rlaif_v_dataset_test.json', '--dataset_names', 'llava_reward', '--dataset_samples', 'all', '--dataset_concatenate_samples', '1', '--max_num_image_per_sample', '8', '--lm_reward_model_name_or_path', 'none', '--vision_reward_model_name_or_path', 'none', '--gradient_checkpointing', '--vis_proj', 'baseline', '--gradient_accumulation_steps', '2', '--zero_stage', '3', '--learning_rate', '1e-6', '--num_warmup_steps', '0.1', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '8', '--eval_step', '200', '--deepspeed', '--output_dir', 'models/test', '--num_train_epochs', '1', '--lang_decoder_update', '--enable_mmca_attention', '--model_architecture', 'llama-3.2-vision', '--trained_reward_model', 'none', '--save_step', '9900', '--precision', 'bf16', '--ranked_candidate_num', '2', '--from_checkpoint', '/localnvme/application/sc_new/wangchenglong_56/base_models/llama-3.2-11b-vision-instruct']
[2024-10-10 20:10:26,615] [INFO] [launch.py:256:main] process 2199290 spawned with command: ['/localnvme/application/sc_new/miniconda3/envs/wcl_rlhf_new/bin/python', '-u', 'training/reward_model_training/rm_training_main.py', '--local_rank=7', '--max_seq_len', '2048', '--image_folder', '/localnvme/application/sc_new/wangchenglong_56/rlhf_llama_vision/data/RLAIF-V-Dataset/images', '--template', 'llama-3.2-vision', '--data_path', '/localnvme/application/sc_new/wangchenglong_56/rlhf_llama_vision/data/RLAIF-V-Dataset/rlaif_v_dataset_train.json', '--eval_data_path', '/localnvme/application/sc_new/wangchenglong_56/rlhf_llama_vision/data/RLAIF-V-Dataset/rlaif_v_dataset_test.json', '--dataset_names', 'llava_reward', '--dataset_samples', 'all', '--dataset_concatenate_samples', '1', '--max_num_image_per_sample', '8', '--lm_reward_model_name_or_path', 'none', '--vision_reward_model_name_or_path', 'none', '--gradient_checkpointing', '--vis_proj', 'baseline', '--gradient_accumulation_steps', '2', '--zero_stage', '3', '--learning_rate', '1e-6', '--num_warmup_steps', '0.1', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '8', '--eval_step', '200', '--deepspeed', '--output_dir', 'models/test', '--num_train_epochs', '1', '--lang_decoder_update', '--enable_mmca_attention', '--model_architecture', 'llama-3.2-vision', '--trained_reward_model', 'none', '--save_step', '9900', '--precision', 'bf16', '--ranked_candidate_num', '2', '--from_checkpoint', '/localnvme/application/sc_new/wangchenglong_56/base_models/llama-3.2-11b-vision-instruct']
[2024-10-10 20:10:28,222] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-10 20:10:28,242] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-10 20:10:28,264] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-10 20:10:28,264] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-10 20:10:28,265] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-10 20:10:28,303] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-10 20:10:28,320] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-10 20:10:28,382] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/localnvme/application/sc_new/miniconda3/envs/wcl_rlhf_new/lib/python3.10/site-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
/localnvme/application/sc_new/miniconda3/envs/wcl_rlhf_new/lib/python3.10/site-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
/localnvme/application/sc_new/miniconda3/envs/wcl_rlhf_new/lib/python3.10/site-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
/localnvme/application/sc_new/miniconda3/envs/wcl_rlhf_new/lib/python3.10/site-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
/localnvme/application/sc_new/miniconda3/envs/wcl_rlhf_new/lib/python3.10/site-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
[2024-10-10 20:10:29,733] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-10-10 20:10:29,733] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
/localnvme/application/sc_new/miniconda3/envs/wcl_rlhf_new/lib/python3.10/site-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
/localnvme/application/sc_new/miniconda3/envs/wcl_rlhf_new/lib/python3.10/site-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
/localnvme/application/sc_new/miniconda3/envs/wcl_rlhf_new/lib/python3.10/site-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
[2024-10-10 20:10:30,086] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-10-10 20:10:30,091] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-10-10 20:10:30,093] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-10-10 20:10:30,114] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-10-10 20:10:30,114] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-10-10 20:10:30,199] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-10-10 20:10:30,199] [INFO] [comm.py:652:init_distributed] cdb=None
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 20%|██ | 1/5 [00:07<00:30, 7.67s/it]
Loading checkpoint shards: 20%|██ | 1/5 [00:08<00:35, 8.76s/it]
Loading checkpoint shards: 20%|██ | 1/5 [00:09<00:36, 9.07s/it]
Loading checkpoint shards: 40%|████ | 2/5 [00:10<00:13, 4.64s/it]
Loading checkpoint shards: 40%|████ | 2/5 [00:11<00:15, 5.31s/it]
Loading checkpoint shards: 40%|████ | 2/5 [00:12<00:16, 5.64s/it]
Loading checkpoint shards: 60%|██████ | 3/5 [00:12<00:07, 3.68s/it]
Loading checkpoint shards: 60%|██████ | 3/5 [00:14<00:08, 4.06s/it]
Loading checkpoint shards: 60%|██████ | 3/5 [00:15<00:08, 4.30s/it]
Loading checkpoint shards: 80%|████████ | 4/5 [00:15<00:03, 3.16s/it]
Loading checkpoint shards: 20%|██ | 1/5 [00:15<01:00, 15.23s/it]
Loading checkpoint shards: 20%|██ | 1/5 [00:15<01:00, 15.21s/it]
Loading checkpoint shards: 20%|██ | 1/5 [00:15<01:01, 15.30s/it]
Loading checkpoint shards: 20%|██ | 1/5 [00:15<01:01, 15.41s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:15<00:00, 2.15s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:15<00:00, 3.09s/it]
Loading checkpoint shards: 20%|██ | 1/5 [00:15<01:02, 15.57s/it]
Loading checkpoint shards: 80%|████████ | 4/5 [00:16<00:03, 3.36s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:16<00:00, 2.30s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:16<00:00, 3.39s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:17<00:00, 2.39s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:17<00:00, 3.55s/it]
Loading checkpoint shards: 40%|████ | 2/5 [00:19<00:25, 8.56s/it]
Loading checkpoint shards: 40%|████ | 2/5 [00:19<00:25, 8.59s/it]
/localnvme/application/sc_new/miniconda3/envs/wcl_rlhf_new/lib/python3.10/site-packages/transformers/optimization.py:591: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
warnings.warn(
[2024-10-10 20:10:51,062] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 8
/localnvme/application/sc_new/miniconda3/envs/wcl_rlhf_new/lib/python3.10/site-packages/transformers/optimization.py:591: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
warnings.warn(
[2024-10-10 20:10:53,379] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 8
/localnvme/application/sc_new/miniconda3/envs/wcl_rlhf_new/lib/python3.10/site-packages/transformers/optimization.py:591: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
warnings.warn(
[2024-10-10 20:10:53,617] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 8
Loading checkpoint shards: 60%|██████ | 3/5 [00:22<00:12, 6.35s/it]
Loading checkpoint shards: 60%|██████ | 3/5 [00:23<00:12, 6.44s/it]
Loading checkpoint shards: 60%|██████ | 3/5 [00:23<00:12, 6.44s/it]
Loading checkpoint shards: 60%|██████ | 3/5 [00:23<00:12, 6.41s/it]
Loading checkpoint shards: 60%|██████ | 3/5 [00:23<00:13, 6.51s/it]
Loading checkpoint shards: 80%|████████ | 4/5 [00:28<00:05, 5.91s/it]
Loading checkpoint shards: 80%|████████ | 4/5 [00:28<00:05, 5.95s/it]
Loading checkpoint shards: 80%|████████ | 4/5 [00:28<00:05, 5.99s/it]
Loading checkpoint shards: 80%|████████ | 4/5 [00:28<00:06, 6.05s/it]
Loading checkpoint shards: 80%|████████ | 4/5 [00:28<00:06, 6.05s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:28<00:00, 3.97s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:28<00:00, 5.73s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:28<00:00, 4.01s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:28<00:00, 5.79s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:29<00:00, 4.03s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:29<00:00, 5.81s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:28<00:00, 4.05s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:28<00:00, 5.79s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:29<00:00, 4.07s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:29<00:00, 5.81s/it]
ViRewardModel(
(v_head): Linear(in_features=4096, out_features=1, bias=False)
(rwtranrsformer): MllamaForConditionalGeneration(
(vision_model): MllamaVisionModel(
(patch_embedding): Conv2d(3, 1280, kernel_size=(14, 14), stride=(14, 14), padding=valid, bias=False)
(gated_positional_embedding): MllamaPrecomputedPositionEmbedding(
(tile_embedding): Embedding(9, 8197120)
)
(pre_tile_positional_embedding): MllamaPrecomputedAspectRatioEmbedding(
(embedding): Embedding(9, 5120)
)
(post_tile_positional_embedding): MllamaPrecomputedAspectRatioEmbedding(
(embedding): Embedding(9, 5120)
)
(layernorm_pre): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
(layernorm_post): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
(transformer): MllamaVisionEncoder(
(layers): ModuleList(
(0-31): 32 x MllamaVisionEncoderLayer(
(self_attn): MllamaVisionSdpaAttention(
(q_proj): Linear(in_features=1280, out_features=1280, bias=False)
(k_proj): Linear(in_features=1280, out_features=1280, bias=False)
(v_proj): Linear(in_features=1280, out_features=1280, bias=False)
(o_proj): Linear(in_features=1280, out_features=1280, bias=False)
)
(mlp): MllamaVisionMLP(
(activation_fn): GELUActivation()
(fc1): Linear(in_features=1280, out_features=5120, bias=True)
(fc2): Linear(in_features=5120, out_features=1280, bias=True)
)
(input_layernorm): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
(post_attention_layernorm): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
)
)
)
(global_transformer): MllamaVisionEncoder(
(layers): ModuleList(
(0-7): 8 x MllamaVisionEncoderLayer(
(self_attn): MllamaVisionSdpaAttention(
(q_proj): Linear(in_features=1280, out_features=1280, bias=False)
(k_proj): Linear(in_features=1280, out_features=1280, bias=False)
(v_proj): Linear(in_features=1280, out_features=1280, bias=False)
(o_proj): Linear(in_features=1280, out_features=1280, bias=False)
)
(mlp): MllamaVisionMLP(
(activation_fn): GELUActivation()
(fc1): Linear(in_features=1280, out_features=5120, bias=True)
(fc2): Linear(in_features=5120, out_features=1280, bias=True)
)
(input_layernorm): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
(post_attention_layernorm): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
)
)
)
)
(language_model): MllamaForCausalLM(
(model): MllamaTextModel(
(embed_tokens): Embedding(128264, 4096, padding_idx=128004)
(layers): ModuleList(
(0-2): 3 x MllamaSelfAttentionDecoderLayer(
(self_attn): MllamaTextSelfSdpaAttention(
(q_proj): Linear(in_features=4096, out_features=4096, bias=False)
(k_proj): Linear(in_features=4096, out_features=1024, bias=False)
(v_proj): Linear(in_features=4096, out_features=1024, bias=False)
(o_proj): Linear(in_features=4096, out_features=4096, bias=False)
)
(mlp): MllamaTextMLP(
(gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
(up_proj): Linear(in_features=4096, out_features=14336, bias=False)
(down_proj): Linear(in_features=14336, out_features=4096, bias=False)
(act_fn): SiLU()
)
(input_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
(post_attention_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
)
(3): MllamaCrossAttentionDecoderLayer(
(cross_attn): MllamaTextCrossSdpaAttention(
(q_proj): Linear(in_features=4096, out_features=4096, bias=False)
(k_proj): Linear(in_features=4096, out_features=1024, bias=False)
(v_proj): Linear(in_features=4096, out_features=1024, bias=False)
(o_proj): Linear(in_features=4096, out_features=4096, bias=False)
(q_norm): MllamaTextRMSNorm((128,), eps=1e-05)
(k_norm): MllamaTextRMSNorm((128,), eps=1e-05)
)
(input_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
(mlp): MllamaTextMLP(
(gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
(up_proj): Linear(in_features=4096, out_features=14336, bias=False)
(down_proj): Linear(in_features=14336, out_features=4096, bias=False)
(act_fn): SiLU()
)
(post_attention_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
)
(4-7): 4 x MllamaSelfAttentionDecoderLayer(
(self_attn): MllamaTextSelfSdpaAttention(
(q_proj): Linear(in_features=4096, out_features=4096, bias=False)
(k_proj): Linear(in_features=4096, out_features=1024, bias=False)
(v_proj): Linear(in_features=4096, out_features=1024, bias=False)
(o_proj): Linear(in_features=4096, out_features=4096, bias=False)
)
(mlp): MllamaTextMLP(
(gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
(up_proj): Linear(in_features=4096, out_features=14336, bias=False)
(down_proj): Linear(in_features=14336, out_features=4096, bias=False)
(act_fn): SiLU()
)
(input_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
(post_attention_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
)
(8): MllamaCrossAttentionDecoderLayer(
(cross_attn): MllamaTextCrossSdpaAttention(
(q_proj): Linear(in_features=4096, out_features=4096, bias=False)
(k_proj): Linear(in_features=4096, out_features=1024, bias=False)
(v_proj): Linear(in_features=4096, out_features=1024, bias=False)
(o_proj): Linear(in_features=4096, out_features=4096, bias=False)
(q_norm): MllamaTextRMSNorm((128,), eps=1e-05)
(k_norm): MllamaTextRMSNorm((128,), eps=1e-05)
)
(input_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
(mlp): MllamaTextMLP(
(gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
(up_proj): Linear(in_features=4096, out_features=14336, bias=False)
(down_proj): Linear(in_features=14336, out_features=4096, bias=False)
(act_fn): SiLU()
)
(post_attention_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
)
(9-12): 4 x MllamaSelfAttentionDecoderLayer(
(self_attn): MllamaTextSelfSdpaAttention(
(q_proj): Linear(in_features=4096, out_features=4096, bias=False)
(k_proj): Linear(in_features=4096, out_features=1024, bias=False)
(v_proj): Linear(in_features=4096, out_features=1024, bias=False)
(o_proj): Linear(in_features=4096, out_features=4096, bias=False)
)
(mlp): MllamaTextMLP(
(gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
(up_proj): Linear(in_features=4096, out_features=14336, bias=False)
(down_proj): Linear(in_features=14336, out_features=4096, bias=False)
(act_fn): SiLU()
)
(input_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
(post_attention_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
)
(13): MllamaCrossAttentionDecoderLayer(
(cross_attn): MllamaTextCrossSdpaAttention(
(q_proj): Linear(in_features=4096, out_features=4096, bias=False)
(k_proj): Linear(in_features=4096, out_features=1024, bias=False)
(v_proj): Linear(in_features=4096, out_features=1024, bias=False)
(o_proj): Linear(in_features=4096, out_features=4096, bias=False)
(q_norm): MllamaTextRMSNorm((128,), eps=1e-05)
(k_norm): MllamaTextRMSNorm((128,), eps=1e-05)
)
(input_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
(mlp): MllamaTextMLP(
(gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
(up_proj): Linear(in_features=4096, out_features=14336, bias=False)
(down_proj): Linear(in_features=14336, out_features=4096, bias=False)
(act_fn): SiLU()
)
(post_attention_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
)
(14-17): 4 x MllamaSelfAttentionDecoderLayer(
(self_attn): MllamaTextSelfSdpaAttention(
(q_proj): Linear(in_features=4096, out_features=4096, bias=False)
(k_proj): Linear(in_features=4096, out_features=1024, bias=False)
(v_proj): Linear(in_features=4096, out_features=1024, bias=False)
(o_proj): Linear(in_features=4096, out_features=4096, bias=False)
)
(mlp): MllamaTextMLP(
(gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
(up_proj): Linear(in_features=4096, out_features=14336, bias=False)
(down_proj): Linear(in_features=14336, out_features=4096, bias=False)
(act_fn): SiLU()
)
(input_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
(post_attention_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
)
(18): MllamaCrossAttentionDecoderLayer(
(cross_attn): MllamaTextCrossSdpaAttention(
(q_proj): Linear(in_features=4096, out_features=4096, bias=False)
(k_proj): Linear(in_features=4096, out_features=1024, bias=False)
(v_proj): Linear(in_features=4096, out_features=1024, bias=False)
(o_proj): Linear(in_features=4096, out_features=4096, bias=False)
(q_norm): MllamaTextRMSNorm((128,), eps=1e-05)
(k_norm): MllamaTextRMSNorm((128,), eps=1e-05)
)
(input_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
(mlp): MllamaTextMLP(
(gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
(up_proj): Linear(in_features=4096, out_features=14336, bias=False)
(down_proj): Linear(in_features=14336, out_features=4096, bias=False)
(act_fn): SiLU()
)
(post_attention_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
)
(19-22): 4 x MllamaSelfAttentionDecoderLayer(
(self_attn): MllamaTextSelfSdpaAttention(
(q_proj): Linear(in_features=4096, out_features=4096, bias=False)
(k_proj): Linear(in_features=4096, out_features=1024, bias=False)
(v_proj): Linear(in_features=4096, out_features=1024, bias=False)
(o_proj): Linear(in_features=4096, out_features=4096, bias=False)
)
(mlp): MllamaTextMLP(
(gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
(up_proj): Linear(in_features=4096, out_features=14336, bias=False)
(down_proj): Linear(in_features=14336, out_features=4096, bias=False)
(act_fn): SiLU()
)
(input_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
(post_attention_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
)
(23): MllamaCrossAttentionDecoderLayer(
(cross_attn): MllamaTextCrossSdpaAttention(
(q_proj): Linear(in_features=4096, out_features=4096, bias=False)
(k_proj): Linear(in_features=4096, out_features=1024, bias=False)
(v_proj): Linear(in_features=4096, out_features=1024, bias=False)
(o_proj): Linear(in_features=4096, out_features=4096, bias=False)
(q_norm): MllamaTextRMSNorm((128,), eps=1e-05)
(k_norm): MllamaTextRMSNorm((128,), eps=1e-05)
)
(input_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
(mlp): MllamaTextMLP(
(gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
(up_proj): Linear(in_features=4096, out_features=14336, bias=False)
(down_proj): Linear(in_features=14336, out_features=4096, bias=False)
(act_fn): SiLU()
)
(post_attention_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
)
(24-27): 4 x MllamaSelfAttentionDecoderLayer(
(self_attn): MllamaTextSelfSdpaAttention(
(q_proj): Linear(in_features=4096, out_features=4096, bias=False)
(k_proj): Linear(in_features=4096, out_features=1024, bias=False)
(v_proj): Linear(in_features=4096, out_features=1024, bias=False)
(o_proj): Linear(in_features=4096, out_features=4096, bias=False)
)
(mlp): MllamaTextMLP(
(gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
(up_proj): Linear(in_features=4096, out_features=14336, bias=False)
(down_proj): Linear(in_features=14336, out_features=4096, bias=False)
(act_fn): SiLU()
)
(input_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
(post_attention_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
)
(28): MllamaCrossAttentionDecoderLayer(
(cross_attn): MllamaTextCrossSdpaAttention(
(q_proj): Linear(in_features=4096, out_features=4096, bias=False)
(k_proj): Linear(in_features=4096, out_features=1024, bias=False)
(v_proj): Linear(in_features=4096, out_features=1024, bias=False)
(o_proj): Linear(in_features=4096, out_features=4096, bias=False)
(q_norm): MllamaTextRMSNorm((128,), eps=1e-05)
(k_norm): MllamaTextRMSNorm((128,), eps=1e-05)
)
(input_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
(mlp): MllamaTextMLP(
(gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
(up_proj): Linear(in_features=4096, out_features=14336, bias=False)
(down_proj): Linear(in_features=14336, out_features=4096, bias=False)
(act_fn): SiLU()
)
(post_attention_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
)
(29-32): 4 x MllamaSelfAttentionDecoderLayer(
(self_attn): MllamaTextSelfSdpaAttention(
(q_proj): Linear(in_features=4096, out_features=4096, bias=False)
(k_proj): Linear(in_features=4096, out_features=1024, bias=False)
(v_proj): Linear(in_features=4096, out_features=1024, bias=False)
(o_proj): Linear(in_features=4096, out_features=4096, bias=False)
)
(mlp): MllamaTextMLP(
(gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
(up_proj): Linear(in_features=4096, out_features=14336, bias=False)
(down_proj): Linear(in_features=14336, out_features=4096, bias=False)
(act_fn): SiLU()
)
(input_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
(post_attention_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
)
(33): MllamaCrossAttentionDecoderLayer(
(cross_attn): MllamaTextCrossSdpaAttention(
(q_proj): Linear(in_features=4096, out_features=4096, bias=False)
(k_proj): Linear(in_features=4096, out_features=1024, bias=False)
(v_proj): Linear(in_features=4096, out_features=1024, bias=False)
(o_proj): Linear(in_features=4096, out_features=4096, bias=False)
(q_norm): MllamaTextRMSNorm((128,), eps=1e-05)
(k_norm): MllamaTextRMSNorm((128,), eps=1e-05)
)
(input_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
(mlp): MllamaTextMLP(
(gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
(up_proj): Linear(in_features=4096, out_features=14336, bias=False)
(down_proj): Linear(in_features=14336, out_features=4096, bias=False)
(act_fn): SiLU()
)
(post_attention_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
)
(34-37): 4 x MllamaSelfAttentionDecoderLayer(
(self_attn): MllamaTextSelfSdpaAttention(
(q_proj): Linear(in_features=4096, out_features=4096, bias=False)
(k_proj): Linear(in_features=4096, out_features=1024, bias=False)
(v_proj): Linear(in_features=4096, out_features=1024, bias=False)
(o_proj): Linear(in_features=4096, out_features=4096, bias=False)
)
(mlp): MllamaTextMLP(
(gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
(up_proj): Linear(in_features=4096, out_features=14336, bias=False)
(down_proj): Linear(in_features=14336, out_features=4096, bias=False)
(act_fn): SiLU()
)
(input_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
(post_attention_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
)
(38): MllamaCrossAttentionDecoderLayer(
(cross_attn): MllamaTextCrossSdpaAttention(
(q_proj): Linear(in_features=4096, out_features=4096, bias=False)
(k_proj): Linear(in_features=4096, out_features=1024, bias=False)
(v_proj): Linear(in_features=4096, out_features=1024, bias=False)
(o_proj): Linear(in_features=4096, out_features=4096, bias=False)
(q_norm): MllamaTextRMSNorm((128,), eps=1e-05)
(k_norm): MllamaTextRMSNorm((128,), eps=1e-05)
)
(input_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
(mlp): MllamaTextMLP(
(gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
(up_proj): Linear(in_features=4096, out_features=14336, bias=False)
(down_proj): Linear(in_features=14336, out_features=4096, bias=False)
(act_fn): SiLU()
)
(post_attention_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
)
(39): MllamaSelfAttentionDecoderLayer(
(self_attn): MllamaTextSelfSdpaAttention(
(q_proj): Linear(in_features=4096, out_features=4096, bias=False)
(k_proj): Linear(in_features=4096, out_features=1024, bias=False)
(v_proj): Linear(in_features=4096, out_features=1024, bias=False)
(o_proj): Linear(in_features=4096, out_features=4096, bias=False)
)
(mlp): MllamaTextMLP(
(gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
(up_proj): Linear(in_features=4096, out_features=14336, bias=False)
(down_proj): Linear(in_features=14336, out_features=4096, bias=False)
(act_fn): SiLU()
)
(input_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
(post_attention_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
)
)
(norm): MllamaTextRMSNorm((4096,), eps=1e-05)
(rotary_emb): MllamaRotaryEmbedding()
)
(lm_head): Linear(in_features=4096, out_features=128256, bias=False)
)
(multi_modal_projector): Linear(in_features=7680, out_features=4096, bias=True)
)
)
/localnvme/application/sc_new/miniconda3/envs/wcl_rlhf_new/lib/python3.10/site-packages/transformers/optimization.py:591: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
warnings.warn(
[2024-10-10 20:11:04,325] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 8
[DATA] Built dataset llava_reward with all 82132 samples.
[DATA] Built dataset llava_reward with all 1000 samples.
/localnvme/application/sc_new/miniconda3/envs/wcl_rlhf_new/lib/python3.10/site-packages/transformers/optimization.py:591: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
warnings.warn(
[2024-10-10 20:11:04,539] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.15.1, git-hash=unknown, git-branch=unknown
[2024-10-10 20:11:04,539] [INFO] [comm.py:677:init_distributed] Distributed backend already initialized
[2024-10-10 20:11:04,539] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 8
/localnvme/application/sc_new/miniconda3/envs/wcl_rlhf_new/lib/python3.10/site-packages/transformers/optimization.py:591: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
warnings.warn(
[2024-10-10 20:11:04,658] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 8
/localnvme/application/sc_new/miniconda3/envs/wcl_rlhf_new/lib/python3.10/site-packages/transformers/optimization.py:591: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
warnings.warn(
[2024-10-10 20:11:04,724] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 8
/localnvme/application/sc_new/miniconda3/envs/wcl_rlhf_new/lib/python3.10/site-packages/transformers/optimization.py:591: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
warnings.warn(
[2024-10-10 20:11:04,863] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 8
[2024-10-10 20:11:48,539] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2024-10-10 20:11:48,541] [INFO] [logging.py:96:log_dist] [Rank 0] Using client Optimizer as basic optimizer
[2024-10-10 20:11:48,541] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
[2024-10-10 20:11:48,569] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = AdamW
[2024-10-10 20:11:48,569] [INFO] [utils.py:59:is_zero_supported_optimizer] Checking ZeRO support for optimizer=AdamW type=<class 'transformers.optimization.AdamW'>
[2024-10-10 20:11:48,569] [WARNING] [engine.py:1232:_do_optimizer_sanity_check] **** You are using ZeRO with an untested optimizer, proceed with caution *****
[2024-10-10 20:11:48,569] [INFO] [logging.py:96:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer, MiCS is enabled False, Hierarchical params gather False
[2024-10-10 20:11:48,569] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 3 optimizer
[2024-10-10 20:11:48,574] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 8
[2024-10-10 20:11:48,575] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 8
[2024-10-10 20:11:48,575] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 8
[2024-10-10 20:11:48,577] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 8
[2024-10-10 20:11:48,578] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 8
[2024-10-10 20:11:48,582] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 8
[2024-10-10 20:11:48,586] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 8
[2024-10-10 20:11:48,883] [INFO] [utils.py:781:see_memory_usage] Stage 3 initialize beginning
[2024-10-10 20:11:48,884] [INFO] [utils.py:782:see_memory_usage] MA 19.87 GB Max_MA 19.87 GB CA 20.18 GB Max_CA 20 GB
[2024-10-10 20:11:48,884] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 110.29 GB, percent = 10.9%
[2024-10-10 20:11:48,888] [INFO] [stage3.py:164:__init__] Reduce bucket size 500000000
[2024-10-10 20:11:48,888] [INFO] [stage3.py:165:__init__] Prefetch bucket size 0
[2024-10-10 20:11:49,130] [INFO] [utils.py:781:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
[2024-10-10 20:11:49,130] [INFO] [utils.py:782:see_memory_usage] MA 19.87 GB Max_MA 19.87 GB CA 20.18 GB Max_CA 20 GB
[2024-10-10 20:11:49,131] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 110.31 GB, percent = 10.9%
[2024-10-10 20:11:49,140] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 8
Parameter Offload: Total persistent parameters: 809251 in 379 params
[2024-10-10 20:11:49,654] [INFO] [utils.py:781:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
[2024-10-10 20:11:49,655] [INFO] [utils.py:782:see_memory_usage] MA 2.52 GB Max_MA 19.89 GB CA 20.34 GB Max_CA 20 GB
[2024-10-10 20:11:49,655] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 110.32 GB, percent = 10.9%
[2024-10-10 20:11:51,701] [INFO] [utils.py:781:see_memory_usage] Before creating fp16 partitions
[2024-10-10 20:11:51,702] [INFO] [utils.py:782:see_memory_usage] MA 2.52 GB Max_MA 2.52 GB CA 20.34 GB Max_CA 20 GB
[2024-10-10 20:11:51,702] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 110.34 GB, percent = 11.0%
[2024-10-10 20:11:54,341] [INFO] [utils.py:781:see_memory_usage] After creating fp16 partitions: 4
[2024-10-10 20:11:54,342] [INFO] [utils.py:782:see_memory_usage] MA 2.52 GB Max_MA 2.52 GB CA 4.13 GB Max_CA 20 GB
[2024-10-10 20:11:54,342] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 114.46 GB, percent = 11.4%
[2024-10-10 20:11:54,543] [INFO] [utils.py:781:see_memory_usage] Before creating fp32 partitions
[2024-10-10 20:11:54,543] [INFO] [utils.py:782:see_memory_usage] MA 2.52 GB Max_MA 2.52 GB CA 4.13 GB Max_CA 4 GB
[2024-10-10 20:11:54,544] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 110.86 GB, percent = 11.0%
[2024-10-10 20:11:54,755] [INFO] [utils.py:781:see_memory_usage] After creating fp32 partitions
[2024-10-10 20:11:54,755] [INFO] [utils.py:782:see_memory_usage] MA 7.09 GB Max_MA 8.11 GB CA 9.73 GB Max_CA 10 GB
[2024-10-10 20:11:54,756] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 109.9 GB, percent = 10.9%
[2024-10-10 20:11:54,943] [INFO] [utils.py:781:see_memory_usage] Before initializing optimizer states
[2024-10-10 20:11:54,943] [INFO] [utils.py:782:see_memory_usage] MA 7.09 GB Max_MA 7.09 GB CA 9.73 GB Max_CA 10 GB
[2024-10-10 20:11:54,943] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 109.9 GB, percent = 10.9%
[2024-10-10 20:11:55,149] [INFO] [utils.py:781:see_memory_usage] After initializing optimizer states
[2024-10-10 20:11:55,149] [INFO] [utils.py:782:see_memory_usage] MA 7.09 GB Max_MA 10.82 GB CA 13.46 GB Max_CA 13 GB
[2024-10-10 20:11:55,149] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 109.9 GB, percent = 10.9%
[2024-10-10 20:11:55,150] [INFO] [stage3.py:517:_setup_for_real_optimizer] optimizer state initialized
0%| | 0/16 [00:00<?, ?it/s]
0%| | 0/16 [00:00<?, ?it/s]
0%| | 0/16 [00:00<?, ?it/s]
0%| | 0/16 [00:00<?, ?it/s]
0%| | 0/16 [00:00<?, ?it/s]
0%| | 0/16 [00:00<?, ?it/s]
0%| | 0/16 [00:00<?, ?it/s][2024-10-10 20:11:56,406] [INFO] [utils.py:781:see_memory_usage] After initializing ZeRO optimizer
[2024-10-10 20:11:56,407] [INFO] [utils.py:782:see_memory_usage] MA 10.3 GB Max_MA 12.26 GB CA 15.42 GB Max_CA 15 GB
[2024-10-10 20:11:56,407] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 109.22 GB, percent = 10.8%
[2024-10-10 20:11:56,407] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedZeroOptimizer_Stage3
[2024-10-10 20:11:56,408] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler
[2024-10-10 20:11:56,408] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = <torch.optim.lr_scheduler.LambdaLR object at 0x155401d7b850>
[2024-10-10 20:11:56,408] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0, 0.0, 0.0], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
[2024-10-10 20:11:56,420] [INFO] [config.py:999:print] DeepSpeedEngine configuration:
[2024-10-10 20:11:56,420] [INFO] [config.py:1003:print] activation_checkpointing_config {
"partition_activations": false,
"contiguous_memory_optimization": false,
"cpu_checkpointing": false,
"number_checkpoints": null,
"synchronize_checkpoint_boundary": false,
"profile": false
}
[2024-10-10 20:11:56,420] [INFO] [config.py:1003:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True, 'use_gds': False}
[2024-10-10 20:11:56,420] [INFO] [config.py:1003:print] amp_enabled .................. False
[2024-10-10 20:11:56,420] [INFO] [config.py:1003:print] amp_params ................... False
[2024-10-10 20:11:56,420] [INFO] [config.py:1003:print] autotuning_config ............ {
"enabled": false,
"start_step": null,
"end_step": null,
"metric_path": null,
"arg_mappings": null,
"metric": "throughput",
"model_info": null,
"results_dir": "autotuning_results",
"exps_dir": "autotuning_exps",
"overwrite": true,
"fast": true,
"start_profile_step": 3,
"end_profile_step": 5,
"tuner_type": "gridsearch",
"tuner_early_stopping": 5,
"tuner_num_trials": 50,
"model_info_path": null,
"mp_size": 1,
"max_train_batch_size": null,
"min_train_batch_size": 1,
"max_train_micro_batch_size_per_gpu": 1.024000e+03,
"min_train_micro_batch_size_per_gpu": 1,
"num_tuning_micro_batch_sizes": 3
}
[2024-10-10 20:11:56,421] [INFO] [config.py:1003:print] bfloat16_enabled ............. True
[2024-10-10 20:11:56,421] [INFO] [config.py:1003:print] bfloat16_immediate_grad_update False
[2024-10-10 20:11:56,421] [INFO] [config.py:1003:print] checkpoint_parallel_write_pipeline False
[2024-10-10 20:11:56,421] [INFO] [config.py:1003:print] checkpoint_tag_validation_enabled True
[2024-10-10 20:11:56,421] [INFO] [config.py:1003:print] checkpoint_tag_validation_fail False
[2024-10-10 20:11:56,421] [INFO] [config.py:1003:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x155401d7bd90>
[2024-10-10 20:11:56,421] [INFO] [config.py:1003:print] communication_data_type ...... None
[2024-10-10 20:11:56,421] [INFO] [config.py:1003:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2024-10-10 20:11:56,421] [INFO] [config.py:1003:print] curriculum_enabled_legacy .... False
[2024-10-10 20:11:56,421] [INFO] [config.py:1003:print] curriculum_params_legacy ..... False
[2024-10-10 20:11:56,421] [INFO] [config.py:1003:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2024-10-10 20:11:56,421] [INFO] [config.py:1003:print] data_efficiency_enabled ...... False
[2024-10-10 20:11:56,421] [INFO] [config.py:1003:print] dataloader_drop_last ......... False
[2024-10-10 20:11:56,421] [INFO] [config.py:1003:print] disable_allgather ............ False
[2024-10-10 20:11:56,421] [INFO] [config.py:1003:print] dump_state ................... False
[2024-10-10 20:11:56,421] [INFO] [config.py:1003:print] dynamic_loss_scale_args ...... None
[2024-10-10 20:11:56,421] [INFO] [config.py:1003:print] eigenvalue_enabled ........... False
[2024-10-10 20:11:56,421] [INFO] [config.py:1003:print] eigenvalue_gas_boundary_resolution 1
[2024-10-10 20:11:56,421] [INFO] [config.py:1003:print] eigenvalue_layer_name ........ bert.encoder.layer
[2024-10-10 20:11:56,421] [INFO] [config.py:1003:print] eigenvalue_layer_num ......... 0
[2024-10-10 20:11:56,421] [INFO] [config.py:1003:print] eigenvalue_max_iter .......... 100
[2024-10-10 20:11:56,421] [INFO] [config.py:1003:print] eigenvalue_stability ......... 1e-06
[2024-10-10 20:11:56,421] [INFO] [config.py:1003:print] eigenvalue_tol ............... 0.01
[2024-10-10 20:11:56,421] [INFO] [config.py:1003:print] eigenvalue_verbose ........... False
[2024-10-10 20:11:56,421] [INFO] [config.py:1003:print] elasticity_enabled ........... False
[2024-10-10 20:11:56,421] [INFO] [config.py:1003:print] flops_profiler_config ........ {
"enabled": false,
"recompute_fwd_factor": 0.0,
"profile_step": 1,
"module_depth": -1,
"top_modules": 1,
"detailed": true,
"output_file": null
}
[2024-10-10 20:11:56,421] [INFO] [config.py:1003:print] fp16_auto_cast ............... None
[2024-10-10 20:11:56,421] [INFO] [config.py:1003:print] fp16_enabled ................. False
[2024-10-10 20:11:56,421] [INFO] [config.py:1003:print] fp16_master_weights_and_gradients False
[2024-10-10 20:11:56,421] [INFO] [config.py:1003:print] global_rank .................. 0
[2024-10-10 20:11:56,421] [INFO] [config.py:1003:print] grad_accum_dtype ............. None
[2024-10-10 20:11:56,421] [INFO] [config.py:1003:print] gradient_accumulation_steps .. 2
[2024-10-10 20:11:56,421] [INFO] [config.py:1003:print] gradient_clipping ............ 1.0
[2024-10-10 20:11:56,422] [INFO] [config.py:1003:print] gradient_predivide_factor .... 1.0
[2024-10-10 20:11:56,422] [INFO] [config.py:1003:print] graph_harvesting ............. False
[2024-10-10 20:11:56,422] [INFO] [config.py:1003:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2024-10-10 20:11:56,422] [INFO] [config.py:1003:print] initial_dynamic_scale ........ 1
[2024-10-10 20:11:56,422] [INFO] [config.py:1003:print] load_universal_checkpoint .... False
[2024-10-10 20:11:56,422] [INFO] [config.py:1003:print] loss_scale ................... 1.0
[2024-10-10 20:11:56,422] [INFO] [config.py:1003:print] memory_breakdown ............. False
[2024-10-10 20:11:56,422] [INFO] [config.py:1003:print] mics_hierarchial_params_gather False
[2024-10-10 20:11:56,422] [INFO] [config.py:1003:print] mics_shard_size .............. -1
[2024-10-10 20:11:56,422] [INFO] [config.py:1003:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=None, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName')
[2024-10-10 20:11:56,422] [INFO] [config.py:1003:print] nebula_config ................ {
"enabled": false,
"persistent_storage_path": null,
"persistent_time_interval": 100,
"num_of_version_in_retention": 2,
"enable_nebula_load": true,
"load_path": null
}
[2024-10-10 20:11:56,422] [INFO] [config.py:1003:print] optimizer_legacy_fusion ...... False
[2024-10-10 20:11:56,422] [INFO] [config.py:1003:print] optimizer_name ............... None
[2024-10-10 20:11:56,422] [INFO] [config.py:1003:print] optimizer_params ............. None
[2024-10-10 20:11:56,422] [INFO] [config.py:1003:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
[2024-10-10 20:11:56,422] [INFO] [config.py:1003:print] pld_enabled .................. False
[2024-10-10 20:11:56,422] [INFO] [config.py:1003:print] pld_params ................... False
[2024-10-10 20:11:56,422] [INFO] [config.py:1003:print] prescale_gradients ........... False
[2024-10-10 20:11:56,422] [INFO] [config.py:1003:print] scheduler_name ............... None
[2024-10-10 20:11:56,422] [INFO] [config.py:1003:print] scheduler_params ............. None
[2024-10-10 20:11:56,422] [INFO] [config.py:1003:print] seq_parallel_communication_data_type torch.float32
[2024-10-10 20:11:56,422] [INFO] [config.py:1003:print] sparse_attention ............. None
[2024-10-10 20:11:56,422] [INFO] [config.py:1003:print] sparse_gradients_enabled ..... False
[2024-10-10 20:11:56,422] [INFO] [config.py:1003:print] steps_per_print .............. 10
[2024-10-10 20:11:56,422] [INFO] [config.py:1003:print] timers_config ................ enabled=True synchronized=True
[2024-10-10 20:11:56,422] [INFO] [config.py:1003:print] train_batch_size ............. 16
[2024-10-10 20:11:56,422] [INFO] [config.py:1003:print] train_micro_batch_size_per_gpu 1
[2024-10-10 20:11:56,422] [INFO] [config.py:1003:print] use_data_before_expert_parallel_ False
[2024-10-10 20:11:56,422] [INFO] [config.py:1003:print] use_node_local_storage ....... False
[2024-10-10 20:11:56,422] [INFO] [config.py:1003:print] wall_clock_breakdown ......... False
[2024-10-10 20:11:56,422] [INFO] [config.py:1003:print] weight_quantization_config ... None
[2024-10-10 20:11:56,422] [INFO] [config.py:1003:print] world_size ................... 8
[2024-10-10 20:11:56,423] [INFO] [config.py:1003:print] zero_allow_untested_optimizer True
[2024-10-10 20:11:56,423] [INFO] [config.py:1003:print] zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500000000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='none', nvme_path=None, buffer_count=5, buffer_size=100000000, max_in_cpu=1000000000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='none', nvme_path=None, buffer_count=4, pin_memory=False, pipeline_read=False, pipeline_write=False, fast_init=False, ratio=1.0) sub_group_size=1000000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=0 param_persistence_threshold=10000 model_persistence_threshold=9223372036854775807 max_live_parameters=30000000 max_reuse_distance=1000000000 gather_16bit_weights_on_model_save=False use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=False pipeline_loading_checkpoint=False override_module_apply=True
[2024-10-10 20:11:56,423] [INFO] [config.py:1003:print] zero_enabled ................. True
[2024-10-10 20:11:56,423] [INFO] [config.py:1003:print] zero_force_ds_cpu_optimizer .. False
[2024-10-10 20:11:56,423] [INFO] [config.py:1003:print] zero_optimization_stage ...... 3
[2024-10-10 20:11:56,423] [INFO] [config.py:989:print_user_config] json = {
"train_batch_size": 16,
"train_micro_batch_size_per_gpu": 1,
"steps_per_print": 10,
"zero_optimization": {
"stage": 3,
"offload_param": {
"device": "none"
},
"offload_optimizer": {
"device": "none"
},
"stage3_param_persistence_threshold": 1.000000e+04,
"stage3_max_live_parameters": 3.000000e+07,
"stage3_prefetch_bucket_size": 0,
"memory_efficient_linear": false
},
"zero_allow_untested_optimizer": true,
"zero_force_ds_cpu_optimizer": false,
"fp16": {
"enabled": false,
"loss_scale_window": 100
},
"bf16": {
"enabled": true
},
"gradient_clipping": 1.0,
"prescale_gradients": false,
"wall_clock_breakdown": false,
"hybrid_engine": {
"enabled": false,
"max_out_tokens": 512,
"inference_tp_size": 1,
"release_inference_cache": false,
"pin_parameters": true,
"tp_gather_partition_size": 8
}
}
***** Before training *****
***** Evaluation Begin *****
0%| | 0/16 [00:00<?, ?it/s]
6%|▋ | 1/16 [00:07<01:50, 7.38s/it]
6%|▋ | 1/16 [00:07<01:56, 7.77s/it]
6%|▋ | 1/16 [00:07<01:56, 7.74s/it]
6%|▋ | 1/16 [00:07<01:56, 7.78s/it]
6%|▋ | 1/16 [00:07<01:56, 7.77s/it]
6%|▋ | 1/16 [00:07<01:56, 7.78s/it]
6%|▋ | 1/16 [00:07<01:56, 7.78s/it]
6%|▋ | 1/16 [00:07<01:56, 7.76s/it]
12%|█▎ | 2/16 [00:15<01:45, 7.54s/it]
12%|█▎ | 2/16 [00:14<01:43, 7.38s/it]
12%|█▎ | 2/16 [00:15<01:45, 7.55s/it]
12%|█▎ | 2/16 [00:15<01:45, 7.54s/it]
12%|█▎ | 2/16 [00:15<01:45, 7.55s/it]
12%|█▎ | 2/16 [00:15<01:45, 7.53s/it]
12%|█▎ | 2/16 [00:15<01:45, 7.54s/it]
12%|█▎ | 2/16 [00:15<01:45, 7.55s/it]
19%|█▉ | 3/16 [00:22<01:35, 7.38s/it]
19%|█▉ | 3/16 [00:22<01:35, 7.38s/it]
19%|█▉ | 3/16 [00:22<01:35, 7.38s/it]
19%|█▉ | 3/16 [00:22<01:35, 7.38s/it]
19%|█▉ | 3/16 [00:22<01:35, 7.38s/it]
19%|█▉ | 3/16 [00:21<01:34, 7.30s/it]
19%|█▉ | 3/16 [00:22<01:35, 7.38s/it]
19%|█▉ | 3/16 [00:22<01:36, 7.39s/it]
25%|██▌ | 4/16 [00:29<01:27, 7.27s/it]
25%|██▌ | 4/16 [00:29<01:27, 7.26s/it]
25%|██▌ | 4/16 [00:29<01:27, 7.27s/it]
25%|██▌ | 4/16 [00:29<01:27, 7.27s/it]
25%|██▌ | 4/16 [00:29<01:26, 7.22s/it]
25%|██▌ | 4/16 [00:29<01:27, 7.27s/it]
25%|██▌ | 4/16 [00:29<01:27, 7.27s/it]
25%|██▌ | 4/16 [00:29<01:27, 7.27s/it]
31%|███▏ | 5/16 [00:36<01:19, 7.23s/it]
31%|███▏ | 5/16 [00:36<01:19, 7.26s/it]
31%|███▏ | 5/16 [00:36<01:19, 7.27s/it]
31%|███▏ | 5/16 [00:36<01:19, 7.26s/it]
31%|███▏ | 5/16 [00:36<01:19, 7.26s/it]
31%|███▏ | 5/16 [00:36<01:19, 7.27s/it]
31%|███▏ | 5/16 [00:36<01:19, 7.27s/it]
31%|███▏ | 5/16 [00:36<01:19, 7.27s/it]
38%|███▊ | 6/16 [00:43<01:12, 7.21s/it]
38%|███▊ | 6/16 [00:43<01:12, 7.21s/it]
38%|███▊ | 6/16 [00:43<01:12, 7.21s/it]
38%|███▊ | 6/16 [00:43<01:12, 7.22s/it]
38%|███▊ | 6/16 [00:43<01:12, 7.21s/it]
38%|███▊ | 6/16 [00:43<01:11, 7.19s/it]
38%|███▊ | 6/16 [00:43<01:12, 7.21s/it]
38%|███▊ | 6/16 [00:43<01:12, 7.22s/it]
44%|████▍ | 7/16 [00:50<01:04, 7.18s/it]
44%|████▍ | 7/16 [00:50<01:04, 7.20s/it]
44%|████▍ | 7/16 [00:50<01:04, 7.20s/it]
44%|████▍ | 7/16 [00:50<01:04, 7.20s/it]
44%|████▍ | 7/16 [00:50<01:04, 7.20s/it]
44%|████▍ | 7/16 [00:50<01:04, 7.20s/it]
44%|████▍ | 7/16 [00:50<01:04, 7.20s/it]
44%|████▍ | 7/16 [00:50<01:04, 7.20s/it]
50%|█████ | 8/16 [00:58<00:57, 7.15s/it]
50%|█████ | 8/16 [00:58<00:57, 7.15s/it]
50%|█████ | 8/16 [00:57<00:57, 7.14s/it]
50%|█████ | 8/16 [00:58<00:57, 7.15s/it]
50%|█████ | 8/16 [00:58<00:57, 7.15s/it]
50%|█████ | 8/16 [00:58<00:57, 7.15s/it]
50%|█████ | 8/16 [00:58<00:57, 7.15s/it]
50%|█████ | 8/16 [00:58<00:57, 7.16s/it]
56%|█████▋ | 9/16 [01:04<00:50, 7.15s/it]
56%|█████▋ | 9/16 [01:05<00:50, 7.16s/it]
56%|█████▋ | 9/16 [01:05<00:50, 7.16s/it]
56%|█████▋ | 9/16 [01:05<00:50, 7.16s/it]
56%|█████▋ | 9/16 [01:05<00:50, 7.16s/it]
56%|█████▋ | 9/16 [01:05<00:50, 7.16s/it]
56%|█████▋ | 9/16 [01:05<00:50, 7.16s/it]
56%|█████▋ | 9/16 [01:05<00:50, 7.16s/it]
62%|██████▎ | 10/16 [01:12<00:42, 7.13s/it]
62%|██████▎ | 10/16 [01:12<00:42, 7.13s/it]
62%|██████▎ | 10/16 [01:12<00:42, 7.13s/it]
62%|██████▎ | 10/16 [01:11<00:42, 7.13s/it]
62%|██████▎ | 10/16 [01:12<00:42, 7.13s/it]
62%|██████▎ | 10/16 [01:12<00:42, 7.13s/it]
62%|██████▎ | 10/16 [01:12<00:42, 7.14s/it]
62%|██████▎ | 10/16 [01:12<00:42, 7.14s/it]
69%|██████▉ | 11/16 [01:19<00:35, 7.18s/it]
69%|██████▉ | 11/16 [01:19<00:35, 7.18s/it]
69%|██████▉ | 11/16 [01:19<00:35, 7.18s/it]
69%|██████▉ | 11/16 [01:19<00:35, 7.18s/it]
69%|██████▉ | 11/16 [01:19<00:35, 7.18s/it]
69%|██████▉ | 11/16 [01:19<00:35, 7.18s/it]
69%|██████▉ | 11/16 [01:19<00:35, 7.18s/it]
69%|██████▉ | 11/16 [01:19<00:35, 7.18s/it]
75%|███████▌ | 12/16 [01:26<00:28, 7.22s/it]
75%|███████▌ | 12/16 [01:26<00:28, 7.22s/it]
75%|███████▌ | 12/16 [01:26<00:28, 7.22s/it]
75%|███████▌ | 12/16 [01:26<00:28, 7.22s/it]
75%|███████▌ | 12/16 [01:26<00:28, 7.22s/it]
75%|███████▌ | 12/16 [01:26<00:28, 7.22s/it]
75%|███████▌ | 12/16 [01:26<00:28, 7.22s/it]
75%|███████▌ | 12/16 [01:26<00:28, 7.22s/it]
81%|████████▏ | 13/16 [01:34<00:21, 7.20s/it]
81%|████████▏ | 13/16 [01:34<00:21, 7.20s/it]
81%|████████▏ | 13/16 [01:34<00:21, 7.20s/it]
81%|████████▏ | 13/16 [01:34<00:21, 7.20s/it]
81%|████████▏ | 13/16 [01:34<00:21, 7.20s/it]
81%|████████▏ | 13/16 [01:33<00:21, 7.20s/it]
81%|████████▏ | 13/16 [01:34<00:21, 7.20s/it]
81%|████████▏ | 13/16 [01:34<00:21, 7.20s/it]
88%|████████▊ | 14/16 [01:41<00:14, 7.22s/it]
88%|████████▊ | 14/16 [01:40<00:14, 7.22s/it]
88%|████████▊ | 14/16 [01:41<00:14, 7.22s/it]
88%|████████▊ | 14/16 [01:41<00:14, 7.22s/it]
88%|████████▊ | 14/16 [01:41<00:14, 7.22s/it]
88%|████████▊ | 14/16 [01:41<00:14, 7.22s/it]
88%|████████▊ | 14/16 [01:41<00:14, 7.22s/it]
88%|████████▊ | 14/16 [01:41<00:14, 7.22s/it]
94%|█████████▍| 15/16 [01:48<00:07, 7.23s/it]
94%|█████████▍| 15/16 [01:48<00:07, 7.23s/it]
94%|█████████▍| 15/16 [01:48<00:07, 7.23s/it]
94%|█████████▍| 15/16 [01:48<00:07, 7.23s/it]
94%|█████████▍| 15/16 [01:48<00:07, 7.23s/it]
94%|█████████▍| 15/16 [01:48<00:07, 7.23s/it]
94%|█████████▍| 15/16 [01:48<00:07, 7.23s/it]
94%|█████████▍| 15/16 [01:48<00:07, 7.23s/it]
100%|██████████| 16/16 [01:53<00:00, 6.54s/it]
100%|██████████| 16/16 [01:53<00:00, 7.09s/it]
100%|██████████| 16/16 [01:53<00:00, 6.54s/it]
100%|██████████| 16/16 [01:53<00:00, 7.09s/it]
100%|██████████| 16/16 [01:53<00:00, 6.54s/it]
100%|██████████| 16/16 [01:53<00:00, 7.09s/it]
100%|██████████| 16/16 [01:53<00:00, 6.54s/it]
100%|██████████| 16/16 [01:53<00:00, 7.09s/it]
100%|██████████| 16/16 [01:53<00:00, 6.54s/it]
100%|██████████| 16/16 [01:53<00:00, 7.09s/it]
100%|██████████| 16/16 [01:53<00:00, 6.54s/it]
100%|██████████| 16/16 [01:53<00:00, 7.09s/it]
100%|██████████| 16/16 [01:53<00:00, 6.54s/it]
0%| | 0/10266 [00:00<?, ?it/s]
100%|██████████| 16/16 [01:53<00:00, 7.07s/it]
100%|██████████| 16/16 [01:53<00:00, 6.54s/it]
100%|██████████| 16/16 [01:53<00:00, 7.09s/it]
0%| | 0/10266 [00:00<?, ?it/s]
0%| | 0/10266 [00:00<?, ?it/s]Eval accuracy: 0.464, Avg of Reward Scores: 0.464
***** Running training *****
Beginning of Epoch 1/1, Total Micro Batches 10266
0%| | 0/10266 [00:00<?, ?it/s]
0%| | 0/10266 [00:00<?, ?it/s]
0%| | 0/10266 [00:00<?, ?it/s]
0%| | 0/10266 [00:00<?, ?it/s]
0%| | 0/10266 [00:00<?, ?it/s]/localnvme/application/sc_new/miniconda3/envs/wcl_rlhf_new/lib/python3.10/site-packages/torch/utils/checkpoint.py:90: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn(
/localnvme/application/sc_new/miniconda3/envs/wcl_rlhf_new/lib/python3.10/site-packages/torch/utils/checkpoint.py:90: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn(
/localnvme/application/sc_new/miniconda3/envs/wcl_rlhf_new/lib/python3.10/site-packages/torch/utils/checkpoint.py:90: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn(
/localnvme/application/sc_new/miniconda3/envs/wcl_rlhf_new/lib/python3.10/site-packages/torch/utils/checkpoint.py:90: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn(
/localnvme/application/sc_new/miniconda3/envs/wcl_rlhf_new/lib/python3.10/site-packages/torch/utils/checkpoint.py:90: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn(
/localnvme/application/sc_new/miniconda3/envs/wcl_rlhf_new/lib/python3.10/site-packages/torch/utils/checkpoint.py:90: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn(
/localnvme/application/sc_new/miniconda3/envs/wcl_rlhf_new/lib/python3.10/site-packages/torch/utils/checkpoint.py:90: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn(
/localnvme/application/sc_new/miniconda3/envs/wcl_rlhf_new/lib/python3.10/site-packages/torch/utils/checkpoint.py:90: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn(
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
Epoch 1, Step: 1, Loss: 0.6796875, Accuracy: 1.0
0%| | 1/10266 [00:08<23:53:54, 8.38s/it]
0%| | 1/10266 [00:08<23:52:57, 8.38s/it]
0%| | 1/10266 [00:08<23:55:00, 8.39s/it]
0%| | 1/10266 [00:08<23:54:59, 8.39s/it]
0%| | 1/10266 [00:08<23:55:24, 8.39s/it]
0%| | 1/10266 [00:08<23:51:41, 8.37s/it]
0%| | 1/10266 [00:08<23:54:13, 8.38s/it]
0%| | 1/10266 [00:08<23:52:59, 8.38s/it]Epoch 1, Step: 2, Loss: 0.71875, Accuracy: 0.0
0%| | 2/10266 [00:17<24:54:51, 8.74s/it]
0%| | 2/10266 [00:17<24:54:24, 8.74s/it]
0%| | 2/10266 [00:17<24:54:51, 8.74s/it]
0%| | 2/10266 [00:17<24:54:01, 8.73s/it]
0%| | 2/10266 [00:17<24:54:32, 8.74s/it]
0%| | 2/10266 [00:17<24:55:01, 8.74s/it]
0%| | 2/10266 [00:17<24:54:01, 8.73s/it]
0%| | 2/10266 [00:17<24:53:30, 8.73s/it]Epoch 1, Step: 3, Loss: 0.7291666666666666, Accuracy: 0.0
0%| | 3/10266 [00:24<23:16:53, 8.17s/it]
0%| | 3/10266 [00:24<23:16:48, 8.17s/it]
0%| | 3/10266 [00:24<23:16:48, 8.17s/it]
0%| | 3/10266 [00:24<23:16:21, 8.16s/it]
0%| | 3/10266 [00:24<23:16:37, 8.17s/it]
0%| | 3/10266 [00:24<23:16:33, 8.16s/it]
0%| | 3/10266 [00:24<23:16:04, 8.16s/it]
0%| | 3/10266 [00:24<23:16:21, 8.16s/it]Epoch 1, Step: 4, Loss: 0.703125, Accuracy: 0.0
0%| | 4/10266 [00:32<22:35:24, 7.92s/it]
0%| | 4/10266 [00:32<22:35:12, 7.92s/it]
0%| | 4/10266 [00:32<22:34:54, 7.92s/it]
0%| | 4/10266 [00:32<22:35:05, 7.92s/it]
0%| | 4/10266 [00:32<22:35:05, 7.92s/it]
0%| | 4/10266 [00:32<22:35:21, 7.92s/it]
0%| | 4/10266 [00:32<22:35:15, 7.92s/it]
0%| | 4/10266 [00:32<22:35:22, 7.92s/it]Epoch 1, Step: 5, Loss: 0.7, Accuracy: 1.0
0%| | 5/10266 [00:39<22:07:07, 7.76s/it]
0%| | 5/10266 [00:39<22:06:59, 7.76s/it]
0%| | 5/10266 [00:39<22:07:05, 7.76s/it]
0%| | 5/10266 [00:39<22:07:05, 7.76s/it]
0%| | 5/10266 [00:39<22:07:01, 7.76s/it]
0%| | 5/10266 [00:39<22:06:48, 7.76s/it]
0%| | 5/10266 [00:39<22:06:54, 7.76s/it]
0%| | 5/10266 [00:39<22:06:54, 7.76s/it]Epoch 1, Step: 6, Loss: 0.6927083333333334, Accuracy: 0.0
0%| | 6/10266 [00:47<21:54:55, 7.69s/it]
0%| | 6/10266 [00:47<21:54:50, 7.69s/it]
0%| | 6/10266 [00:47<21:54:54, 7.69s/it]
0%| | 6/10266 [00:47<21:54:54, 7.69s/it]
0%| | 6/10266 [00:47<21:54:47, 7.69s/it]
0%| | 6/10266 [00:47<21:54:51, 7.69s/it]
0%| | 6/10266 [00:47<21:54:47, 7.69s/it]
0%| | 6/10266 [00:47<21:54:43, 7.69s/it]Epoch 1, Step: 7, Loss: 0.6785714285714286, Accuracy: 1.0
0%| | 7/10266 [00:54<21:45:20, 7.63s/it]
0%| | 7/10266 [00:54<21:45:20, 7.63s/it]
0%| | 7/10266 [00:54<21:45:18, 7.63s/it]
0%| | 7/10266 [00:54<21:45:18, 7.63s/it]
0%| | 7/10266 [00:54<21:45:12, 7.63s/it]
0%| | 7/10266 [00:54<21:45:16, 7.63s/it]
0%| | 7/10266 [00:54<21:45:21, 7.63s/it]
0%| | 7/10266 [00:54<21:45:16, 7.63s/it][2024-10-10 20:14:47,871] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 2199283
Thank you so much for your help!
Could you please provide the roles for mistral and llama3.2-version?
Thanks!
Thank you so much for your help!
Could you please provide the roles for mistral and llama3.2-version?
Thanks!
We have updated this script!
The roles used are as follows:
Mistral:
Llama-3.2:
Reference: https://github.com/lm-sys/FastChat/blob/main/fastchat/conversation.py https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_2
Thank you so much for your help!
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f3f32cae6fc in /opt/conda/envs/vrm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #7: + 0xd3e95 (0x7f40271b5e95 in /opt/conda/envs/vrm/bin/../lib/libstdc++.so.6)
frame #8: + 0x744b (0x7f4030bdf44b in /lib64/libpthread.so.0)
frame #9: clone + 0x3f (0x7f40301d352f in /lib64/libc.so.6)
Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f3f319a8f86 in /opt/conda/envs/vrm/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe5aa84 (0x7f3f32937a84 in /opt/conda/envs/vrm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0xd3e95 (0x7f40271b5e95 in /opt/conda/envs/vrm/bin/../lib/libstdc++.so.6)
frame #3: + 0x744b (0x7f4030bdf44b in /lib64/libpthread.so.0)
frame #4: clone + 0x3f (0x7f40301d352f in /lib64/libc.so.6)