Problem about pretrain parameter dim size is differen to the model dim size?

I got a problem after I downloaded everything from hugging face and run the finetune.sh . I have totally no idea how to figure out it... This is part of error information, it still has a lot which I didn't paste, because it's all different layers' same problem

Traceback (most recent call last): File "/root/autodl-tmp/Video-LLaVA/videollava/train/train_mem.py", line 13, in <module> train() File "/root/autodl-tmp/Video-LLaVA/videollava/train/train.py", line 1003, in train model.get_model().initialize_vision_modules( File "/root/autodl-tmp/Video-LLaVA/videollava/model/llava_arch.py", line 66, in initialize_vision_modules image_tower = build_image_tower(model_args) File "/root/autodl-tmp/Video-LLaVA/videollava/model/multimodal_encoder/builder.py", line 11, in build_image_tower return CLIPVisionTower(image_tower, args=image_tower_cfg, **kwargs) File "/root/autodl-tmp/Video-LLaVA/videollava/model/multimodal_encoder/clip_encoder.py", line 18, in __init__ self.load_model() File "/root/autodl-tmp/Video-LLaVA/videollava/model/multimodal_encoder/clip_encoder.py", line 24, in load_model self.vision_tower = CLIPVisionModel.from_pretrained(self.vision_tower_name) File "/root/miniconda3/envs/videollava/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2903, in from_pretrained ) = cls._load_pretrained_model( File "/root/miniconda3/envs/videollava/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3310, in _load_pretrained_model raise RuntimeError(f"Error(s) in loading state_dict for {model.__class__.__name__}:\n\t{error_msg}") RuntimeError: Error(s) in loading state_dict for CLIPVisionModel: size mismatch for vision_model.embeddings.class_embedding: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for vision_model.embeddings.patch_embedding.weight: copying a param with shape torch.Size([1024, 3, 14, 14]) from checkpoint, the shape in current model is torch.Size([768, 3, 32, 32]). size mismatch for vision_model.embeddings.position_embedding.weight: copying a param with shape torch.Size([257, 1024]) from checkpoint, the shape in current model is torch.Size([50, 768]). size mismatch for vision_model.pre_layrnorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for vision_model.pre_layrnorm.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for vision_model.encoder.layers.0.self_attn.k_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]). size mismatch for vision_model.encoder.layers.0.self_attn.k_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for vision_model.encoder.layers.0.self_attn.v_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]). size mismatch for vision_model.encoder.layers.0.self_attn.v_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for vision_model.encoder.layers.0.self_attn.q_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]). size mismatch for vision_model.encoder.layers.0.self_attn.q_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for vision_model.encoder.layers.0.self_attn.out_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]). size mismatch for vision_model.encoder.layers.0.self_attn.out_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for vision_model.encoder.layers.0.layer_norm1.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for vision_model.encoder.layers.0.layer_norm1.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for vision_model.encoder.layers.0.mlp.fc1.weight: copying a param with shape torch.Size([4096, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 768]). size mismatch for vision_model.encoder.layers.0.mlp.fc1.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([3072]). size mismatch for vision_model.encoder.layers.0.mlp.fc2.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([768, 3072]). size mismatch for vision_model.encoder.layers.0.mlp.fc2.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for vision_model.encoder.layers.0.layer_norm2.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for vision_model.encoder.layers.0.layer_norm2.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for vision_model.encoder.layers.1.self_attn.k_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]). size mismatch for vision_model.encoder.layers.1.self_attn.k_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for vision_model.encoder.layers.1.self_attn.v_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]). size mismatch for vision_model.encoder.layers.1.self_attn.v_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for vision_model.encoder.layers.1.self_attn.q_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]). size mismatch for vision_model.encoder.layers.1.self_attn.q_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for vision_model.encoder.layers.1.self_attn.out_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]). size mismatch for vision_model.encoder.layers.1.self_attn.out_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for vision_model.encoder.layers.1.layer_norm1.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for vision_model.encoder.layers.1.layer_norm1.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for vision_model.encoder.layers.1.mlp.fc1.weight: copying a param with shape torch.Size([4096, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 768]). …… size mismatch for vision_model.encoder.layers.11.mlp.fc2.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([768, 3072]).

PKU-YuanGroup / Video-LLaVA

Problem about pretrain parameter dim size is differen to the model dim size? #151