PKU-YuanGroup / Video-LLaVA

【EMNLP 2024🔥】Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
https://arxiv.org/pdf/2311.10122.pdf
Apache License 2.0
3.04k stars 220 forks source link

Size mismatch error when running locally. #152

Closed ssuncheol closed 6 months ago

ssuncheol commented 6 months ago

I want to run the script below locally, but a size mismatch occurs in the process of importing the model checkpoint. How to solve this problem. Scripts and models are shown below.

Script : https://github.com/PKU-YuanGroup/Video-LLaVA?tab=readme-ov-file#inference-for-image

LanguageWind_Image : https://huggingface.co/LanguageBind/LanguageBind_Image

LanguageBind_Video_merge : https://huggingface.co/LanguageBind/LanguageBind_Video_merge


# Video-LLaVA/videollava/model/multimodal_encoder/builder.py

import os
from .clip_encoder import CLIPVisionTower
from .languagebind import LanguageBindImageTower, LanguageBindVideoTower

def build_image_tower(image_tower_cfg, **kwargs):
    image_tower = getattr(image_tower_cfg, 'mm_image_tower', getattr(image_tower_cfg, 'image_tower', None))
    return LanguageBindImageTower(image_tower, args=image_tower_cfg, cache_dir='./cache_dir', **kwargs)

def build_video_tower(video_tower_cfg, **kwargs):
    video_tower = getattr(video_tower_cfg, 'mm_video_tower', getattr(video_tower_cfg, 'video_tower', None))
    return LanguageBindVideoTower(video_tower, args=video_tower_cfg, cache_dir='./cache_dir', **kwargs)
FanshuoZeng commented 4 months ago

I also encountered the same problem, how did you solve it?

Cu2ta1n commented 4 months ago

Same question

Y-J-Zhang commented 2 months ago

same, have anyone solved this problem?