PKU-YuanGroup / Video-LLaVA

【EMNLP 2024🔥】Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
https://arxiv.org/pdf/2311.10122.pdf
Apache License 2.0
3.02k stars 220 forks source link

Some weights of the model checkpoint at "./Video-LLaVA-7B" were not used when initializing LlavaLlamaForCausalLM: #153

Open ssuncheol opened 6 months ago

ssuncheol commented 6 months ago

When I execute Video-LLaVA-7B to make a text, the following issue occurred. How to solve this problem. Scripts and models are shown below.

Script : https://github.com/PKU-YuanGroup/Video-LLaVA?tab=readme-ov-file#inference-for-video

LanguageWind_Image : https://huggingface.co/LanguageBind/LanguageBind_Image

LanguageBind_Video_merge : https://huggingface.co/LanguageBind/LanguageBind_Video_merge


# Video-LLaVA/videollava/model/multimodal_encoder/builder.py

import os
from .clip_encoder import CLIPVisionTower
from .languagebind import LanguageBindImageTower, LanguageBindVideoTower

def build_image_tower(image_tower_cfg, **kwargs):
    image_tower = getattr(image_tower_cfg, 'mm_image_tower', getattr(image_tower_cfg, 'image_tower', None))
    return LanguageBindImageTower(image_tower, args=image_tower_cfg, cache_dir='./cache_dir', **kwargs)

def build_video_tower(video_tower_cfg, **kwargs):
    video_tower = getattr(video_tower_cfg, 'mm_video_tower', getattr(video_tower_cfg, 'video_tower', None))
    return LanguageBindVideoTower(video_tower, args=video_tower_cfg, cache_dir='./cache_dir', **kwargs)