PKU-YuanGroup / Video-LLaVA

【EMNLP 2024🔥】Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
https://arxiv.org/pdf/2311.10122.pdf
Apache License 2.0
2.99k stars 219 forks source link

error:RuntimeError: Error(s) in loading state_dict for CLIPVisionModel: size mismatch for vision_model.embeddings.class_embedding: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]). #175

Open zapqqqwe opened 4 months ago

zapqqqwe commented 4 months ago

为什么报错呢,下载的LanguageBind/LanguageBind_Video_merge 和 LanguageBind/LanguageBind_Image 在本地同时,在config.json文件修改了mm_video_tower 和 mm_image_tower 分别为本地的位置,但是报错,我看好像clip的隐藏层768但是设置的为1024,怎么解决呢

zapqqqwe commented 4 months ago

huggface里面的LanguageBind/LanguageBind_Video_merge 和 LanguageBind/LanguageBind_Image

Liu98C commented 4 months ago

huggface里面的LanguageBind/LanguageBind_Video_merge 和 LanguageBind/LanguageBind_Image

请问您解决了吗

Liu98C commented 4 months ago

huggface里面的LanguageBind/LanguageBind_Video_merge 和 LanguageBind/LanguageBind_Image

https://github.com/PKU-YuanGroup/Video-LLaVA/issues/57#issuecomment-1880367313