LLaVA-VL / LLaVA-NeXT

Apache License 2.0
2.44k stars 174 forks source link

size mismatch for vision_model.embeddings.patch_embedding.weight: #246

Closed hshc123 closed 14 hours ago

hshc123 commented 3 days ago

Hello, author.

When running the inference demo of the model "lmms-lab/LLaVA-Video-7B-Qwen2," an error occurred while loading the vision tower (siglip-so400m-patch14-384):

File "/home/jeeves/LLaVA-NeXT-main/llava/model/multimodal_encoder/clip_encoder.py", line 41, in load_model self.vision_tower = CLIPVisionModel.from_pretrained(self.vision_tower_name, device_map=device_map) File "/home/jeeves/.conda/envs/zyy_llava_next_video/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3677, in from_pretrained ) = cls._load_pretrained_model( File "/home/jeeves/.conda/envs/zyy_llava_next_video/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4155, in _load_pretrained_model raise RuntimeError(f"Error(s) in loading state_dict for {model.class.name}:\n\t{error_msg}") RuntimeError: Error(s) in loading state_dict for CLIPVisionModel: size mismatch for vision_model.embeddings.patch_embedding.weight: copying a param with shape torch.Size([1152, 3, 14, 14]) from checkpoint, the shape in current model is torch.Size([768, 3, 32, 32]). size mismatch for vision_model.embeddings.position_embedding.weight: copying a param with shape torch.Size([729, 1152]) from checkpoint, the shape in current model is torch.Size([50, 768]). size mismatch for vision_model.encoder.layers.0.self_attn.k_proj.weight: copying a param with shape torch.Size([1152, 1152]) from checkpoint, the shape in current model is torch.Size([768, 768]).

Caozhou1995 commented 14 hours ago

if u load ckpt locally, fix code as follows: image @hshc123

hshc123 commented 14 hours ago

Thank you! It's working now. @Caozhou1995