dvlab-research / MGM

Official repo for "Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models"
Apache License 2.0
3.16k stars 279 forks source link

images_aux encode error #4

Closed duisuangsheng closed 4 months ago

duisuangsheng commented 5 months ago

When using clip.py for inference, I encountered the following error. How should I solve it?

File "/mnt/bn/codegen-finetune/projects/MiniGemini/minigemini/model/mini_gemini_arch.py", line 255, in encode_images if images_aux is not None: File "/root/miniconda3/envs/vl_model/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, kwargs) File "/root/miniconda3/envs/vl_model/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(*args, *kwargs) File "/mnt/bn/codegen-finetune/projects/MiniGemini/minigemini/model/multimodal_encoder/clip_encoder.py", line 58, in forward image_features = self.image_forward(images) File "/mnt/bn/codegen-finetune/projects/MiniGemini/minigemini/model/multimodal_encoder/clip_encoder.py", line 50, in image_forward image_forward_outs = self.vision_tower(images.to(device=self.device, dtype=self.dtype), output_hidden_states=True) File "/root/miniconda3/envs/vl_model/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(args, kwargs) File "/root/miniconda3/envs/vl_model/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 917, in forward return self.vision_model( File "/root/miniconda3/envs/vl_model/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, *kwargs) File "/root/miniconda3/envs/vl_model/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 841, in forward hidden_states = self.embeddings(pixel_values) File "/root/miniconda3/envs/vl_model/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(args, **kwargs) File "/root/miniconda3/envs/vl_model/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 187, in forward embeddings = embeddings + self.position_embedding(self.position_ids) RuntimeError: The size of tensor a (2305) must match the size of tensor b (50) at non-singleton dimension 1

yanwei-li commented 5 months ago

Hi, thanks for your interest in our work. Could you please provide the specific commend when you run this model? We do not meet this error when using cli.py

duisuangsheng commented 5 months ago

Hi, thanks for your interest in our work. Could you please provide the specific commend when you run this model? We do not meet this error when using cli.py

python3 -m minigemini.serve.cli --model-path ./Mini-Gemini-8x7B-HD --image-file ./flow.jpeg

I downloaded clip model through the link provided on the homepage. The path is as follows. Please help me check whether it is correct. Thank you very much for the reply. https://huggingface.co/openai/clip-vit-large-patch14-336 https://huggingface.co/laion/CLIP-convnext_large_d_320.laion2B-s29B-b131K-ft-soup

yanwei-li commented 5 months ago

It seems the error is caused by the clip module. Please make sure the transformers version is >=4.36.2 and the downloaded models are organized as in README.