2U1 / Qwen2-VL-Finetune

An open-source implementaion for fine-tuning Qwen2-VL series by Alibaba Cloud.
Apache License 2.0
122 stars 12 forks source link

Shape Mismtach #12

Open mano3-1 opened 3 weeks ago

mano3-1 commented 3 weeks ago

Hi, Thank for such an wonderful repo. I was trying to train the model with a custom dataset using the lora script and getting the below error:

[2024-10-29 17:59:25,985] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
^C^C[2024-10-29 17:59:27,593] [WARNING] [runner.py:215:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2024-10-29 17:59:27,593] [INFO] [runner.py:607:main] cmd = /home/mk.thomas/miniconda3/envs/qwen2/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None src/training/train.py --lora_enable True --lora_namespan_exclude ['lm_head', 'embed_tokens'] --lora_rank 64 --lora_alpha 128 --lora_dropout 0.05 --num_lora_modules -1 --deepspeed scripts/zero2.json --model_id Qwen/Qwen2-VL-7B-Instruct --data_path /home/mk.thomas/llmops/data/ml/qwen-7b-VL/v1/data/result/combined_test.json --image_folder /home/mk.thomas/llmops/data/ml/qwen-7b-VL/v1/data/images --freeze_vision_tower False --freeze_llm False --tune_merger True --bf16 True --fp16 False --disable_flash_attn2 False --output_dir output/testing_lora --num_train_epochs 1 --per_device_train_batch_size 2 --gradient_accumulation_steps 1 --min_pixels 200704 --max_pixels 1003520 --learning_rate 1e-4 --merger_lr 1e-5 --vision_lr 2e-6 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type cosine --logging_steps 1 --tf32 True --gradient_checkpointing True --report_to tensorboard --lazy_preprocess True --save_strategy steps --save_steps 200 --save_total_limit 10 --dataloader_num_workers 4
[2024-10-29 17:59:29,092] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-29 17:59:30,720] [INFO] [launch.py:146:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
[2024-10-29 17:59:30,720] [INFO] [launch.py:152:main] nnodes=1, num_local_procs=8, node_rank=0
[2024-10-29 17:59:30,720] [INFO] [launch.py:163:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[2024-10-29 17:59:30,720] [INFO] [launch.py:164:main] dist_world_size=8
[2024-10-29 17:59:30,720] [INFO] [launch.py:168:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
[2024-10-29 17:59:30,721] [INFO] [launch.py:256:main] process 206519 spawned with command: ['/home/mk.thomas/miniconda3/envs/qwen2/bin/python', '-u', 'src/training/train.py', '--local_rank=0', '--lora_enable', 'True', '--lora_namespan_exclude', "['lm_head', 'embed_tokens']", '--lora_rank', '64', '--lora_alpha', '128', '--lora_dropout', '0.05', '--num_lora_modules', '-1', '--deepspeed', 'scripts/zero2.json', '--model_id', 'Qwen/Qwen2-VL-7B-Instruct', '--data_path', '/home/mk.thomas/llmops/data/ml/qwen-7b-VL/v1/data/result/combined_test.json', '--image_folder', '/home/mk.thomas/llmops/data/ml/qwen-7b-VL/v1/data/images', '--freeze_vision_tower', 'False', '--freeze_llm', 'False', '--tune_merger', 'True', '--bf16', 'True', '--fp16', 'False', '--disable_flash_attn2', 'False', '--output_dir', 'output/testing_lora', '--num_train_epochs', '1', '--per_device_train_batch_size', '2', '--gradient_accumulation_steps', '1', '--min_pixels', '200704', '--max_pixels', '1003520', '--learning_rate', '1e-4', '--merger_lr', '1e-5', '--vision_lr', '2e-6', '--weight_decay', '0.', '--warmup_ratio', '0.03', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'True', '--gradient_checkpointing', 'True', '--report_to', 'tensorboard', '--lazy_preprocess', 'True', '--save_strategy', 'steps', '--save_steps', '200', '--save_total_limit', '10', '--dataloader_num_workers', '4']
[2024-10-29 17:59:30,721] [INFO] [launch.py:256:main] process 206520 spawned with command: ['/home/mk.thomas/miniconda3/envs/qwen2/bin/python', '-u', 'src/training/train.py', '--local_rank=1', '--lora_enable', 'True', '--lora_namespan_exclude', "['lm_head', 'embed_tokens']", '--lora_rank', '64', '--lora_alpha', '128', '--lora_dropout', '0.05', '--num_lora_modules', '-1', '--deepspeed', 'scripts/zero2.json', '--model_id', 'Qwen/Qwen2-VL-7B-Instruct', '--data_path', '/home/mk.thomas/llmops/data/ml/qwen-7b-VL/v1/data/result/combined_test.json', '--image_folder', '/home/mk.thomas/llmops/data/ml/qwen-7b-VL/v1/data/images', '--freeze_vision_tower', 'False', '--freeze_llm', 'False', '--tune_merger', 'True', '--bf16', 'True', '--fp16', 'False', '--disable_flash_attn2', 'False', '--output_dir', 'output/testing_lora', '--num_train_epochs', '1', '--per_device_train_batch_size', '2', '--gradient_accumulation_steps', '1', '--min_pixels', '200704', '--max_pixels', '1003520', '--learning_rate', '1e-4', '--merger_lr', '1e-5', '--vision_lr', '2e-6', '--weight_decay', '0.', '--warmup_ratio', '0.03', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'True', '--gradient_checkpointing', 'True', '--report_to', 'tensorboard', '--lazy_preprocess', 'True', '--save_strategy', 'steps', '--save_steps', '200', '--save_total_limit', '10', '--dataloader_num_workers', '4']
[2024-10-29 17:59:30,722] [INFO] [launch.py:256:main] process 206521 spawned with command: ['/home/mk.thomas/miniconda3/envs/qwen2/bin/python', '-u', 'src/training/train.py', '--local_rank=2', '--lora_enable', 'True', '--lora_namespan_exclude', "['lm_head', 'embed_tokens']", '--lora_rank', '64', '--lora_alpha', '128', '--lora_dropout', '0.05', '--num_lora_modules', '-1', '--deepspeed', 'scripts/zero2.json', '--model_id', 'Qwen/Qwen2-VL-7B-Instruct', '--data_path', '/home/mk.thomas/llmops/data/ml/qwen-7b-VL/v1/data/result/combined_test.json', '--image_folder', '/home/mk.thomas/llmops/data/ml/qwen-7b-VL/v1/data/images', '--freeze_vision_tower', 'False', '--freeze_llm', 'False', '--tune_merger', 'True', '--bf16', 'True', '--fp16', 'False', '--disable_flash_attn2', 'False', '--output_dir', 'output/testing_lora', '--num_train_epochs', '1', '--per_device_train_batch_size', '2', '--gradient_accumulation_steps', '1', '--min_pixels', '200704', '--max_pixels', '1003520', '--learning_rate', '1e-4', '--merger_lr', '1e-5', '--vision_lr', '2e-6', '--weight_decay', '0.', '--warmup_ratio', '0.03', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'True', '--gradient_checkpointing', 'True', '--report_to', 'tensorboard', '--lazy_preprocess', 'True', '--save_strategy', 'steps', '--save_steps', '200', '--save_total_limit', '10', '--dataloader_num_workers', '4']
[2024-10-29 17:59:30,722] [INFO] [launch.py:256:main] process 206522 spawned with command: ['/home/mk.thomas/miniconda3/envs/qwen2/bin/python', '-u', 'src/training/train.py', '--local_rank=3', '--lora_enable', 'True', '--lora_namespan_exclude', "['lm_head', 'embed_tokens']", '--lora_rank', '64', '--lora_alpha', '128', '--lora_dropout', '0.05', '--num_lora_modules', '-1', '--deepspeed', 'scripts/zero2.json', '--model_id', 'Qwen/Qwen2-VL-7B-Instruct', '--data_path', '/home/mk.thomas/llmops/data/ml/qwen-7b-VL/v1/data/result/combined_test.json', '--image_folder', '/home/mk.thomas/llmops/data/ml/qwen-7b-VL/v1/data/images', '--freeze_vision_tower', 'False', '--freeze_llm', 'False', '--tune_merger', 'True', '--bf16', 'True', '--fp16', 'False', '--disable_flash_attn2', 'False', '--output_dir', 'output/testing_lora', '--num_train_epochs', '1', '--per_device_train_batch_size', '2', '--gradient_accumulation_steps', '1', '--min_pixels', '200704', '--max_pixels', '1003520', '--learning_rate', '1e-4', '--merger_lr', '1e-5', '--vision_lr', '2e-6', '--weight_decay', '0.', '--warmup_ratio', '0.03', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'True', '--gradient_checkpointing', 'True', '--report_to', 'tensorboard', '--lazy_preprocess', 'True', '--save_strategy', 'steps', '--save_steps', '200', '--save_total_limit', '10', '--dataloader_num_workers', '4']
[2024-10-29 17:59:30,723] [INFO] [launch.py:256:main] process 206523 spawned with command: ['/home/mk.thomas/miniconda3/envs/qwen2/bin/python', '-u', 'src/training/train.py', '--local_rank=4', '--lora_enable', 'True', '--lora_namespan_exclude', "['lm_head', 'embed_tokens']", '--lora_rank', '64', '--lora_alpha', '128', '--lora_dropout', '0.05', '--num_lora_modules', '-1', '--deepspeed', 'scripts/zero2.json', '--model_id', 'Qwen/Qwen2-VL-7B-Instruct', '--data_path', '/home/mk.thomas/llmops/data/ml/qwen-7b-VL/v1/data/result/combined_test.json', '--image_folder', '/home/mk.thomas/llmops/data/ml/qwen-7b-VL/v1/data/images', '--freeze_vision_tower', 'False', '--freeze_llm', 'False', '--tune_merger', 'True', '--bf16', 'True', '--fp16', 'False', '--disable_flash_attn2', 'False', '--output_dir', 'output/testing_lora', '--num_train_epochs', '1', '--per_device_train_batch_size', '2', '--gradient_accumulation_steps', '1', '--min_pixels', '200704', '--max_pixels', '1003520', '--learning_rate', '1e-4', '--merger_lr', '1e-5', '--vision_lr', '2e-6', '--weight_decay', '0.', '--warmup_ratio', '0.03', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'True', '--gradient_checkpointing', 'True', '--report_to', 'tensorboard', '--lazy_preprocess', 'True', '--save_strategy', 'steps', '--save_steps', '200', '--save_total_limit', '10', '--dataloader_num_workers', '4']
[2024-10-29 17:59:30,723] [INFO] [launch.py:256:main] process 206524 spawned with command: ['/home/mk.thomas/miniconda3/envs/qwen2/bin/python', '-u', 'src/training/train.py', '--local_rank=5', '--lora_enable', 'True', '--lora_namespan_exclude', "['lm_head', 'embed_tokens']", '--lora_rank', '64', '--lora_alpha', '128', '--lora_dropout', '0.05', '--num_lora_modules', '-1', '--deepspeed', 'scripts/zero2.json', '--model_id', 'Qwen/Qwen2-VL-7B-Instruct', '--data_path', '/home/mk.thomas/llmops/data/ml/qwen-7b-VL/v1/data/result/combined_test.json', '--image_folder', '/home/mk.thomas/llmops/data/ml/qwen-7b-VL/v1/data/images', '--freeze_vision_tower', 'False', '--freeze_llm', 'False', '--tune_merger', 'True', '--bf16', 'True', '--fp16', 'False', '--disable_flash_attn2', 'False', '--output_dir', 'output/testing_lora', '--num_train_epochs', '1', '--per_device_train_batch_size', '2', '--gradient_accumulation_steps', '1', '--min_pixels', '200704', '--max_pixels', '1003520', '--learning_rate', '1e-4', '--merger_lr', '1e-5', '--vision_lr', '2e-6', '--weight_decay', '0.', '--warmup_ratio', '0.03', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'True', '--gradient_checkpointing', 'True', '--report_to', 'tensorboard', '--lazy_preprocess', 'True', '--save_strategy', 'steps', '--save_steps', '200', '--save_total_limit', '10', '--dataloader_num_workers', '4']
[2024-10-29 17:59:30,724] [INFO] [launch.py:256:main] process 206525 spawned with command: ['/home/mk.thomas/miniconda3/envs/qwen2/bin/python', '-u', 'src/training/train.py', '--local_rank=6', '--lora_enable', 'True', '--lora_namespan_exclude', "['lm_head', 'embed_tokens']", '--lora_rank', '64', '--lora_alpha', '128', '--lora_dropout', '0.05', '--num_lora_modules', '-1', '--deepspeed', 'scripts/zero2.json', '--model_id', 'Qwen/Qwen2-VL-7B-Instruct', '--data_path', '/home/mk.thomas/llmops/data/ml/qwen-7b-VL/v1/data/result/combined_test.json', '--image_folder', '/home/mk.thomas/llmops/data/ml/qwen-7b-VL/v1/data/images', '--freeze_vision_tower', 'False', '--freeze_llm', 'False', '--tune_merger', 'True', '--bf16', 'True', '--fp16', 'False', '--disable_flash_attn2', 'False', '--output_dir', 'output/testing_lora', '--num_train_epochs', '1', '--per_device_train_batch_size', '2', '--gradient_accumulation_steps', '1', '--min_pixels', '200704', '--max_pixels', '1003520', '--learning_rate', '1e-4', '--merger_lr', '1e-5', '--vision_lr', '2e-6', '--weight_decay', '0.', '--warmup_ratio', '0.03', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'True', '--gradient_checkpointing', 'True', '--report_to', 'tensorboard', '--lazy_preprocess', 'True', '--save_strategy', 'steps', '--save_steps', '200', '--save_total_limit', '10', '--dataloader_num_workers', '4']
[2024-10-29 17:59:30,725] [INFO] [launch.py:256:main] process 206526 spawned with command: ['/home/mk.thomas/miniconda3/envs/qwen2/bin/python', '-u', 'src/training/train.py', '--local_rank=7', '--lora_enable', 'True', '--lora_namespan_exclude', "['lm_head', 'embed_tokens']", '--lora_rank', '64', '--lora_alpha', '128', '--lora_dropout', '0.05', '--num_lora_modules', '-1', '--deepspeed', 'scripts/zero2.json', '--model_id', 'Qwen/Qwen2-VL-7B-Instruct', '--data_path', '/home/mk.thomas/llmops/data/ml/qwen-7b-VL/v1/data/result/combined_test.json', '--image_folder', '/home/mk.thomas/llmops/data/ml/qwen-7b-VL/v1/data/images', '--freeze_vision_tower', 'False', '--freeze_llm', 'False', '--tune_merger', 'True', '--bf16', 'True', '--fp16', 'False', '--disable_flash_attn2', 'False', '--output_dir', 'output/testing_lora', '--num_train_epochs', '1', '--per_device_train_batch_size', '2', '--gradient_accumulation_steps', '1', '--min_pixels', '200704', '--max_pixels', '1003520', '--learning_rate', '1e-4', '--merger_lr', '1e-5', '--vision_lr', '2e-6', '--weight_decay', '0.', '--warmup_ratio', '0.03', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'True', '--gradient_checkpointing', 'True', '--report_to', 'tensorboard', '--lazy_preprocess', 'True', '--save_strategy', 'steps', '--save_steps', '200', '--save_total_limit', '10', '--dataloader_num_workers', '4']
[2024-10-29 17:59:36,225] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-29 17:59:36,789] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-29 17:59:36,900] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-29 17:59:36,978] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-10-29 17:59:37,001] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-29 17:59:37,009] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-29 17:59:37,025] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-29 17:59:37,026] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-29 17:59:37,050] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-29 17:59:37,522] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-10-29 17:59:37,684] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-10-29 17:59:37,783] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-10-29 17:59:37,805] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-10-29 17:59:37,808] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-10-29 17:59:37,817] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-10-29 17:59:37,817] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-10-29 17:59:37,823] [INFO] [comm.py:652:init_distributed] cdb=None
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
`Qwen2VLRotaryEmbedding` can now be fully parameterized by passing the model config through the `config` argument. All other arguments will be removed in v4.46
`Qwen2VLRotaryEmbedding` can now be fully parameterized by passing the model config through the `config` argument. All other arguments will be removed in v4.46
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
`Qwen2VLRotaryEmbedding` can now be fully parameterized by passing the model config through the `config` argument. All other arguments will be removed in v4.46
`Qwen2VLRotaryEmbedding` can now be fully parameterized by passing the model config through the `config` argument. All other arguments will be removed in v4.46
Loading checkpoint shards:   0%|                                                                   | 0/5 [00:00<?, ?it/s]You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
Loading checkpoint shards:   0%|                                                                   | 0/5 [00:00<?, ?it/s]`Qwen2VLRotaryEmbedding` can now be fully parameterized by passing the model config through the `config` argument. All other arguments will be removed in v4.46
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
`Qwen2VLRotaryEmbedding` can now be fully parameterized by passing the model config through the `config` argument. All other arguments will be removed in v4.46
`Qwen2VLRotaryEmbedding` can now be fully parameterized by passing the model config through the `config` argument. All other arguments will be removed in v4.46
Loading checkpoint shards:   0%|                                                                   | 0/5 [00:00<?, ?it/s]`Qwen2VLRotaryEmbedding` can now be fully parameterized by passing the model config through the `config` argument. All other arguments will be removed in v4.46
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████| 5/5 [00:00<00:00,  6.28it/s]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████| 5/5 [00:00<00:00,  6.37it/s]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████| 5/5 [00:00<00:00,  6.82it/s]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████| 5/5 [00:00<00:00,  6.62it/s]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████| 5/5 [00:00<00:00,  6.27it/s]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████| 5/5 [00:00<00:00,  5.71it/s]
Found 196 lora modules: ['model.layers.0.self_attn.q_proj', 'model.layers.0.self_attn.k_proj', 'model.layers.0.self_attn.v_proj', 'model.layers.0.self_attn.o_proj', 'model.layers.0.mlp.gate_proj', 'model.layers.0.mlp.up_proj', 'model.layers.0.mlp.down_proj', 'model.layers.1.self_attn.q_proj', 'model.layers.1.self_attn.k_proj', 'model.layers.1.self_attn.v_proj', 'model.layers.1.self_attn.o_proj', 'model.layers.1.mlp.gate_proj', 'model.layers.1.mlp.up_proj', 'model.layers.1.mlp.down_proj', 'model.layers.2.self_attn.q_proj', 'model.layers.2.self_attn.k_proj', 'model.layers.2.self_attn.v_proj', 'model.layers.2.self_attn.o_proj', 'model.layers.2.mlp.gate_proj', 'model.layers.2.mlp.up_proj', 'model.layers.2.mlp.down_proj', 'model.layers.3.self_attn.q_proj', 'model.layers.3.self_attn.k_proj', 'model.layers.3.self_attn.v_proj', 'model.layers.3.self_attn.o_proj', 'model.layers.3.mlp.gate_proj', 'model.layers.3.mlp.up_proj', 'model.layers.3.mlp.down_proj', 'model.layers.4.self_attn.q_proj', 'model.layers.4.self_attn.k_proj', 'model.layers.4.self_attn.v_proj', 'model.layers.4.self_attn.o_proj', 'model.layers.4.mlp.gate_proj', 'model.layers.4.mlp.up_proj', 'model.layers.4.mlp.down_proj', 'model.layers.5.self_attn.q_proj', 'model.layers.5.self_attn.k_proj', 'model.layers.5.self_attn.v_proj', 'model.layers.5.self_attn.o_proj', 'model.layers.5.mlp.gate_proj', 'model.layers.5.mlp.up_proj', 'model.layers.5.mlp.down_proj', 'model.layers.6.self_attn.q_proj', 'model.layers.6.self_attn.k_proj', 'model.layers.6.self_attn.v_proj', 'model.layers.6.self_attn.o_proj', 'model.layers.6.mlp.gate_proj', 'model.layers.6.mlp.up_proj', 'model.layers.6.mlp.down_proj', 'model.layers.7.self_attn.q_proj', 'model.layers.7.self_attn.k_proj', 'model.layers.7.self_attn.v_proj', 'model.layers.7.self_attn.o_proj', 'model.layers.7.mlp.gate_proj', 'model.layers.7.mlp.up_proj', 'model.layers.7.mlp.down_proj', 'model.layers.8.self_attn.q_proj', 'model.layers.8.self_attn.k_proj', 'model.layers.8.self_attn.v_proj', 'model.layers.8.self_attn.o_proj', 'model.layers.8.mlp.gate_proj', 'model.layers.8.mlp.up_proj', 'model.layers.8.mlp.down_proj', 'model.layers.9.self_attn.q_proj', 'model.layers.9.self_attn.k_proj', 'model.layers.9.self_attn.v_proj', 'model.layers.9.self_attn.o_proj', 'model.layers.9.mlp.gate_proj', 'model.layers.9.mlp.up_proj', 'model.layers.9.mlp.down_proj', 'model.layers.10.self_attn.q_proj', 'model.layers.10.self_attn.k_proj', 'model.layers.10.self_attn.v_proj', 'model.layers.10.self_attn.o_proj', 'model.layers.10.mlp.gate_proj', 'model.layers.10.mlp.up_proj', 'model.layers.10.mlp.down_proj', 'model.layers.11.self_attn.q_proj', 'model.layers.11.self_attn.k_proj', 'model.layers.11.self_attn.v_proj', 'model.layers.11.self_attn.o_proj', 'model.layers.11.mlp.gate_proj', 'model.layers.11.mlp.up_proj', 'model.layers.11.mlp.down_proj', 'model.layers.12.self_attn.q_proj', 'model.layers.12.self_attn.k_proj', 'model.layers.12.self_attn.v_proj', 'model.layers.12.self_attn.o_proj', 'model.layers.12.mlp.gate_proj', 'model.layers.12.mlp.up_proj', 'model.layers.12.mlp.down_proj', 'model.layers.13.self_attn.q_proj', 'model.layers.13.self_attn.k_proj', 'model.layers.13.self_attn.v_proj', 'model.layers.13.self_attn.o_proj', 'model.layers.13.mlp.gate_proj', 'model.layers.13.mlp.up_proj', 'model.layers.13.mlp.down_proj', 'model.layers.14.self_attn.q_proj', 'model.layers.14.self_attn.k_proj', 'model.layers.14.self_attn.v_proj', 'model.layers.14.self_attn.o_proj', 'model.layers.14.mlp.gate_proj', 'model.layers.14.mlp.up_proj', 'model.layers.14.mlp.down_proj', 'model.layers.15.self_attn.q_proj', 'model.layers.15.self_attn.k_proj', 'model.layers.15.self_attn.v_proj', 'model.layers.15.self_attn.o_proj', 'model.layers.15.mlp.gate_proj', 'model.layers.15.mlp.up_proj', 'model.layers.15.mlp.down_proj', 'model.layers.16.self_attn.q_proj', 'model.layers.16.self_attn.k_proj', 'model.layers.16.self_attn.v_proj', 'model.layers.16.self_attn.o_proj', 'model.layers.16.mlp.gate_proj', 'model.layers.16.mlp.up_proj', 'model.layers.16.mlp.down_proj', 'model.layers.17.self_attn.q_proj', 'model.layers.17.self_attn.k_proj', 'model.layers.17.self_attn.v_proj', 'model.layers.17.self_attn.o_proj', 'model.layers.17.mlp.gate_proj', 'model.layers.17.mlp.up_proj', 'model.layers.17.mlp.down_proj', 'model.layers.18.self_attn.q_proj', 'model.layers.18.self_attn.k_proj', 'model.layers.18.self_attn.v_proj', 'model.layers.18.self_attn.o_proj', 'model.layers.18.mlp.gate_proj', 'model.layers.18.mlp.up_proj', 'model.layers.18.mlp.down_proj', 'model.layers.19.self_attn.q_proj', 'model.layers.19.self_attn.k_proj', 'model.layers.19.self_attn.v_proj', 'model.layers.19.self_attn.o_proj', 'model.layers.19.mlp.gate_proj', 'model.layers.19.mlp.up_proj', 'model.layers.19.mlp.down_proj', 'model.layers.20.self_attn.q_proj', 'model.layers.20.self_attn.k_proj', 'model.layers.20.self_attn.v_proj', 'model.layers.20.self_attn.o_proj', 'model.layers.20.mlp.gate_proj', 'model.layers.20.mlp.up_proj', 'model.layers.20.mlp.down_proj', 'model.layers.21.self_attn.q_proj', 'model.layers.21.self_attn.k_proj', 'model.layers.21.self_attn.v_proj', 'model.layers.21.self_attn.o_proj', 'model.layers.21.mlp.gate_proj', 'model.layers.21.mlp.up_proj', 'model.layers.21.mlp.down_proj', 'model.layers.22.self_attn.q_proj', 'model.layers.22.self_attn.k_proj', 'model.layers.22.self_attn.v_proj', 'model.layers.22.self_attn.o_proj', 'model.layers.22.mlp.gate_proj', 'model.layers.22.mlp.up_proj', 'model.layers.22.mlp.down_proj', 'model.layers.23.self_attn.q_proj', 'model.layers.23.self_attn.k_proj', 'model.layers.23.self_attn.v_proj', 'model.layers.23.self_attn.o_proj', 'model.layers.23.mlp.gate_proj', 'model.layers.23.mlp.up_proj', 'model.layers.23.mlp.down_proj', 'model.layers.24.self_attn.q_proj', 'model.layers.24.self_attn.k_proj', 'model.layers.24.self_attn.v_proj', 'model.layers.24.self_attn.o_proj', 'model.layers.24.mlp.gate_proj', 'model.layers.24.mlp.up_proj', 'model.layers.24.mlp.down_proj', 'model.layers.25.self_attn.q_proj', 'model.layers.25.self_attn.k_proj', 'model.layers.25.self_attn.v_proj', 'model.layers.25.self_attn.o_proj', 'model.layers.25.mlp.gate_proj', 'model.layers.25.mlp.up_proj', 'model.layers.25.mlp.down_proj', 'model.layers.26.self_attn.q_proj', 'model.layers.26.self_attn.k_proj', 'model.layers.26.self_attn.v_proj', 'model.layers.26.self_attn.o_proj', 'model.layers.26.mlp.gate_proj', 'model.layers.26.mlp.up_proj', 'model.layers.26.mlp.down_proj', 'model.layers.27.self_attn.q_proj', 'model.layers.27.self_attn.k_proj', 'model.layers.27.self_attn.v_proj', 'model.layers.27.self_attn.o_proj', 'model.layers.27.mlp.gate_proj', 'model.layers.27.mlp.up_proj', 'model.layers.27.mlp.down_proj']
Adding LoRA to the model...
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████| 5/5 [00:00<00:00,  5.62it/s]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████| 5/5 [00:00<00:00,  6.22it/s]
libibverbs: Warning: couldn't open config directory '/etc/libibverbs.d'.
libibverbs: Warning: couldn't open config directory '/etc/libibverbs.d'.
libibverbs: Warning: couldn't open config directory '/etc/libibverbs.d'.
libibverbs: Warning: couldn't open config directory '/etc/libibverbs.d'.
libibverbs: Warning: couldn't open config directory '/etc/libibverbs.d'.
libibverbs: Warning: couldn't open config directory '/etc/libibverbs.d'.
libibverbs: Warning: couldn't open config directory '/etc/libibverbs.d'.
libibverbs: Warning: couldn't open config directory '/etc/libibverbs.d'.
[rank6]: Traceback (most recent call last):
[rank6]:   File "/home/mk.thomas/Qwen2-VL-Finetune/src/training/train.py", line 212, in <module>
[rank6]:     train()
[rank6]:   File "/home/mk.thomas/Qwen2-VL-Finetune/src/training/train.py", line 187, in train
[rank6]:     trainer.train()
[rank6]:   File "/home/mk.thomas/miniconda3/envs/qwen2/lib/python3.10/site-packages/transformers/trainer.py", line 2054, in train
[rank6]:     return inner_training_loop(
[rank6]:   File "/home/mk.thomas/miniconda3/envs/qwen2/lib/python3.10/site-packages/transformers/trainer.py", line 2390, in _inner_training_loop
[rank6]:     tr_loss_step = self.training_step(model, inputs)
[rank6]:   File "/home/mk.thomas/miniconda3/envs/qwen2/lib/python3.10/site-packages/transformers/trainer.py", line 3487, in training_step
[rank6]:     loss = self.compute_loss(model, inputs)
[rank6]:   File "/home/mk.thomas/miniconda3/envs/qwen2/lib/python3.10/site-packages/transformers/trainer.py", line 3534, in compute_loss
[rank6]:     outputs = model(**inputs)
[rank6]:   File "/home/mk.thomas/miniconda3/envs/qwen2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank6]:     return self._call_impl(*args, **kwargs)
[rank6]:   File "/home/mk.thomas/miniconda3/envs/qwen2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank6]:     return forward_call(*args, **kwargs)
[rank6]:   File "/home/mk.thomas/miniconda3/envs/qwen2/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank6]:     ret_val = func(*args, **kwargs)
[rank6]:   File "/home/mk.thomas/miniconda3/envs/qwen2/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1899, in forward
[rank6]:     loss = self.module(*inputs, **kwargs)
[rank6]:   File "/home/mk.thomas/miniconda3/envs/qwen2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank6]:     return self._call_impl(*args, **kwargs)
[rank6]:   File "/home/mk.thomas/miniconda3/envs/qwen2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank6]:     return forward_call(*args, **kwargs)
[rank6]:   File "/home/mk.thomas/miniconda3/envs/qwen2/lib/python3.10/site-packages/peft/peft_model.py", line 563, in forward
[rank6]:     return self.get_base_model()(*args, **kwargs)
[rank6]:   File "/home/mk.thomas/miniconda3/envs/qwen2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank6]:     return self._call_impl(*args, **kwargs)
[rank6]:   File "/home/mk.thomas/miniconda3/envs/qwen2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank6]:     return forward_call(*args, **kwargs)
[rank6]:   File "/home/mk.thomas/miniconda3/envs/qwen2/lib/python3.10/site-packages/liger_kernel/transformers/model/qwen2_vl.py", line 108, in lce_forward
[rank6]:     inputs_embeds[image_mask] = image_embeds
[rank6]: RuntimeError: shape mismatch: value tensor of shape [7360, 3584] cannot be broadcast to indexing result of shape [2439, 3584]

Any help would be appreciated!

Thanks in Advance

2U1 commented 3 weeks ago

Does your dataset has mixed modality? It does not supprot mixed modality for now. I'm finding a way for it.

mano3-1 commented 3 weeks ago

I've raised PR #14 to address handling of mixed-modality datasets. Specifically, I'm excluding all_pixel_values and all_image_grid_thw from the final data_dict and implemented a data sampler to maintain homogeneous batches.

Could you help identify possible causes for a broadcasting error here?

2U1 commented 3 weeks ago

@mano3-1 It's a bit wierd I can't reproduce a same issue. Also I'm looking at the PR you've made, Thanks for the PR I'll review it.

Also, in your code I think you are using the latest transformers. Can you use the version in the README? It can load the model but, I'm not sure the mrope is properly loaded from the newest version.

mano3-1 commented 3 weeks ago

Hey, I installed the recommended version of transformers (from your readme). Still getting the same result (I tried with image-text pairs)

2U1 commented 3 weeks ago

@mano3-1 It's weird that it works fine for me right now. I'll keep debug it.

2U1 commented 3 weeks ago

@mano3-1 Can you share your part of the data you are training? I think it's a type of a data problem because I'm testing with my sample of dataset but it dosen't show any errors.

Julian2002 commented 2 weeks ago

I am also running into this problem right now, my dataset does not need to be mixed modality for my workload; how would one format according to the llava specificiation without it being mixed modality?, do you have an example of an entry?

2U1 commented 2 weeks ago

@Julian2002 @mano3-1

[
  {
    "id": "traffic_img1442_jpg",
    "image": "AccidentDetection/train/traffic_img1442_jpg.rf.0c12f50f745b52cff83669488676faac.jpg",
    "conversations": [
      {
        "from": "human",
        "value": "<image>\nCan you describe what\u2019s happening in the image?"
      },
      {
        "from": "gpt",
        "value": "In the image, there appears to be a scene of destruction where a vehicle is partially submerged in what looks like a swimming pool. The vehicle seems to have suffered significant damage, with parts of it distorted or broken. The pool has a blue tiled surface, and the water looks disturbed, indicating the vehicle has either recently entered the pool or some activity has occurred. The setting seems unusual since vehicles are not typically found in swimming pools, suggesting an accident or a staged event for entertainment or testing purposes."
      },
      {
        "from": "human",
        "value": "What objects are identifiable in the image, and where are they located?"
      },
      {
        "from": "gpt",
        "value": "The primary object in the image is a damaged vehicle. It is located at the center of the image, running lengthwise along the pool. The vehicle's rear is towards the bottom of the image, and the front is towards the top part, mostly submerged in water. The pool itself is another identifiable object, with its blue tiled lining clearly visible throughout the image. There is also a reddish-pink border running along the right side of the pool. The words \"DestructionNation\" are overlaid in the upper-left corner of the image."
      },
      {
        "from": "human",
        "value": "Based on the visual cues, what can be inferred about the possible purpose or convalue of this scene?"
      },
      {
        "from": "gpt",
        "value": "From the visual cues, it can be inferred that this scene is likely set up for an event or demonstration involving the destruction of vehicles, possibly for entertainment, research, or testing purposes. The presence of the value \"DestructionNation\" suggests a branding associated with destruction or crash testing, which might indicate that the scene is either part of a video, a show, or an experimental setup focusing on vehicle destruction dynamics. Additionally, the unusual placement of the vehicle in a swimming pool and the apparent significant damage imply that this might be a staged incident rather than an actual accident. This type of scene is frequently seen in scenarios where vehicles are put through extreme conditions to study impacts, gather data, or entertain viewers by displaying dramatic crashes."
      }
    ]
  },
  {
    "id": "SEV_22_139_jpg",
    "image": "accident3.0/train/SEV_22_139_jpg.rf.c1bfaa1cd384988e55a7f37de9e8cbd5.jpg",
    "conversations": [
      {
        "from": "human",
        "value": "<image>\nWhat could be the possible cause of the congestion in the image, and how might it affect the vehicles on the highway?"
      },
      {
        "from": "gpt",
        "value": "In the image, depicted is a scene on a highway with considerable congestion. A likely cause of this congestion is the overturned vehicle in the middle of the road. The orientation and position of this vehicle suggest an accident or a loss of control which resulted in the vehicle flipping onto the road. To the left, a few cars are cautiously navigating around the incident, potentially slowing down to avoid collision and to assess the situation. In the background, more vehicles are seen approaching the congestion zone, indicating a build-up of traffic. The presence of the overturned vehicle necessitates careful maneuvering, potentially causing additional delays as other vehicles have to slow down significantly or change lanes to safely pass by. This bottleneck can result in a cascading effect, leading to slower overall traffic flow and prolonged congestion on the highway. Verifying the scene, no emergency response teams are visible, which might further exacerbate the delay until the situation is addressed."
      }
    ]
  },
  {
    "id": "argusAcc321_jpg",
    "image": "accident3.0/train/argusAcc321_jpg.rf.5bb271f3699a9db06865c12f6f77b682.jpg",
    "conversations": [
      {
        "from": "human",
        "value": "<image>\nBased on the visual elements observed, what potential traffic situation is depicted here, and how might it impact the flow of vehicles at this location?"
      },
      {
        "from": "gpt",
        "value": "The image shows a white car and a black car positioned close to each other, possibly indicating a traffic incident, such as a minor collision or contact. Both cars are at an intersection angle, suggesting they might have maneuvered into each other\u2019s paths. The proximity of the vehicles could cause a temporary obstruction, impacting the traffic flow by forcing other drivers to slow down or navigate around the stopped cars. If this is a busy area, it might lead to congestion or cause other vehicles to reroute. Additionally, without visible pedestrians, motorcycles, or signs, the focus remains solely on the cars, indicating nearby traffic measures might need to address vehicle-only issues cohesively. Ensuring such locations have clear signage and perhaps traffic lights can help mitigate similar incidents in the future."
      }
    ]
  },
  {
    "id": "acc1252_jpg",
    "image": "accident3.0/train/acc1252_jpg.rf.71f76bf5073e3012383af6ffd8f752dd.jpg",
    "conversations": [
      {
        "from": "human",
        "value": "<image>\nBased on the given image, describe the potential causes and consequences of the situation depicted? Include the interactions between the objects and people present."
      },
      {
        "from": "gpt",
        "value": "The image shows an overturned car in the middle of a road, which likely indicates a significant accident. There are a notable number of pedestrians and riders gathered in proximity to the overturned vehicle, suggesting that they may be either witnesses or people offering help. An emergency response vehicle, specifically an ambulance, is also visible nearby, suggesting medical assistance is either in progress or being prepared. The presence of multiple cars near the scene, some appearing to be pulled over, implies that the accident might have caused traffic disruptions. The truck and bus are positioned further back, potentially aware of the incident and thus stopping short. This setup indicates a serious incident with potential injuries, given the medical presence, and highlights the role of bystanders and responders in handling road accidents. The consequences likely include traffic delays, injuries from the crash, and an ongoing response to attend to any victims."
      }
    ]
  }
]

This is the dataset I'm using for testing right now.