PKU-YuanGroup / Video-LLaVA

【EMNLP 2024🔥】Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
https://arxiv.org/pdf/2311.10122.pdf
Apache License 2.0
2.88k stars 207 forks source link

size mismatch #176

Open cs19469 opened 2 months ago

cs19469 commented 2 months ago

`File "/home/work/chengshuaibo/Video-LLaVA/videollava/train/train_mem.py", line 17, in model.get_model().initialize_vision_modules(model.get_model().initialize_vision_modules(

  File "/home/work/chengshuaibo/Video-LLaVA/videollava/model/llava_arch.py", line 122, in initialize_vision_modules

model.get_model().initialize_vision_modules( File "/home/work/chengshuaibo/Video-LLaVA/videollava/model/llava_arch.py", line 122, in initialize_vision_modules model.get_model().initialize_vision_modules(

File "/home/work/chengshuaibo/Video-LLaVA/videollava/model/llava_arch.py", line 122, in initialize_vision_modules File "/home/work/chengshuaibo/Video-LLaVA/videollava/model/llava_arch.py", line 122, in initialize_vision_modules train() train() File "/home/work/chengshuaibo/Video-LLaVA/videollava/train/train.py", line 1003, in train self.mm_projector.load_state_dict(get_w(mm_projector_weights, 'mm_projector')) train()

self.mm_projector.load_state_dict(get_w(mm_projector_weights, 'mm_projector')) self.mm_projector.load_state_dict(get_w(mm_projector_weights, 'mm_projector')) File "/home/work/chengshuaibo/Video-LLaVA/videollava/train/train.py", line 1003, in train self.mm_projector.load_state_dict(get_w(mm_projector_weights, 'mm_projector')) File "/home/work/miniforge3/envs/videollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict

File "/home/work/chengshuaibo/Video-LLaVA/videollava/train/train.py", line 1003, in train

File "/home/work/miniforge3/envs/videollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict File "/home/work/miniforge3/envs/videollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict File "/home/work/miniforge3/envs/videollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict model.get_model().initialize_vision_modules( File "/home/work/chengshuaibo/Video-LLaVA/videollava/model/llava_arch.py", line 122, in initialize_vision_modules model.get_model().initialize_vision_modules(model.get_model().initialize_vision_modules(

File "/home/work/chengshuaibo/Video-LLaVA/videollava/model/llava_arch.py", line 122, in initialize_vision_modules File "/home/work/chengshuaibo/Video-LLaVA/videollava/model/llava_arch.py", line 122, in initialize_vision_modules self.mm_projector.load_state_dict(get_w(mm_projector_weights, 'mm_projector')) File "/home/work/miniforge3/envs/videollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict self.mm_projector.load_state_dict(get_w(mm_projector_weights, 'mm_projector'))self.mm_projector.load_state_dict(get_w(mm_projector_weights, 'mm_projector'))

File "/home/work/miniforge3/envs/videollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict File "/home/work/miniforge3/envs/videollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(raise Runt`

the finetune.lora.sh `JSON_FOLDER="JSON" IMAGE_FOLDER="DATA_ROOT" VIDEO_FOLDER="DATA_ROOT" model_name_or_path="./cache_dir/models--LanguageBind--Video-LLaVA-7B/snapshots/aecae02b7dee5c249e096dcb0ce546eb6f811806" pretrain_mm_mlp_adapter=./checkpoints/videollava-7b-pretrain cd ~/chengshuaibo/Video-LLaVA

HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 deepspeed videollava/train/train_mem.py \ --lora_enable True --lora_r 128 --lora_alpha 256 --mm_projector_lr 2e-5 \ --deepspeed ./scripts/zero3_offload.json \ --model_name_or_path $model_name_or_path \ --version v1 \ --data_path ${JSON_FOLDER}/videochatgpttune.json \ --image_folder ${IMAGE_FOLDER} \ --image_tower ./cache_dir/models--LanguageBind--LanguageBind_Image \ --video_folder ${VIDEO_FOLDER} \ --video_tower ./cache_dir/models--LanguageBind--LanguageBind_Video_merge \ --mm_projector_type mlp2x_gelu \ --pretrain_mm_mlp_adapter $pretrain_mm_mlp_adapter/mm_projector.bin \ --mm_vision_select_layer -2 \ --mm_use_im_start_end False \ --mm_use_im_patch_token False \ --image_aspect_ratio pad \ --group_by_modality_length True \ --bf16 True \ --output_dir ./checkpoints/videollava-7b-lora \ --num_train_epochs 1 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 4 \ --gradient_accumulation_steps 1 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 50000 \ --save_total_limit 1 \ --learning_rate 2e-4 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --tf32 True \ --model_max_length 2048 --tokenizer_model_max_length 3072 \ --gradient_checkpointing True \ --dataloader_num_workers 4 \ --lazy_preprocess True \ --report_to tensorboard \ --cache_dir "./cache_dir" ` Hi, when I am ready to train with lora, I have the above problem, is there something wrong with the mm_projector?