dvlab-research / LLaMA-VID

LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models (ECCV 2024)
Apache License 2.0
742 stars 44 forks source link

AssertionError: Size mismatch! image_features: 1, prompts: 8 #95

Open szbcasia opened 6 months ago

szbcasia commented 6 months ago

Hello, I encountered the following problems during the second phase of training:

File "/home/AI_project/LLaMA-VID/llamavid/model/language_model/llava_llama_vid.py", line 80, in forward
    input_ids, attention_mask, past_key_values, inputs_embeds, labels = self.prepare_inputs_labels_for_multimodal(input_ids, attention_mask, past_key_values, labels, images, prompts=prompts)
  File "/home/AI_project/LLaMA-VID/llamavid/model/llamavid_arch.py", line 532, in prepare_inputs_labels_for_multimodal
    image_features = self.encode_images(images, prompts, long_video=long_video)
  File "/home/AI_project/LLaMA-VID/llamavid/model/llamavid_arch.py", line 341, in encode_images
    image_features = self.vlm_attention(image_features,
  File "/home/AI_project/LLaMA-VID/llamavid/model/llamavid_arch.py", line 350, in vlm_attention
    assert len(image_features) == len(
AssertionError: Size mismatch! image_features: 1, prompts: 8

Is there any solution?