DAMO-NLP-SG / Video-LLaMA

[EMNLP 2023 Demo] Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
BSD 3-Clause "New" or "Revised" License
2.77k stars 255 forks source link

finetune-billa7b-zh inference error shape '[-1, 136]' is invalid for input of size 137 #161

Open len2618187 opened 5 months ago

len2618187 commented 5 months ago

Hi , Thank you very much for your great work I encountered some problems while using finetune-billa7b-zh model for inference. The configuration is as follows:

model:
arch: video_llama
    model_type: pretrain_vicuna
    freeze_vit: True
    freeze_qformer: True
    max_txt_len: 512
    end_sym: "###"
    low_resource: False
    frozen_llama_proj: False
    q_former_model: "pretrain_model/q_former_model/blip2_pretrained_flant5xxl.pth"
    vit_model: "pretrain_model/vit_model/eva_vit_g.pth"
    llama_model: "pretrain_model/[BiLLa-7B-SFT](https://huggingface.co/Neutralzz/BiLLa-7B-SFT)"
    ckpt: "pretrain_model/video_llama_zh/finetune-billa7b-zh.pth"
    equip_audio_branch: False
    fusion_head_layers: 2
    max_frame_pos: 32
    fusion_header_type: "seqTransf"

datasets:
  webvid:
    vis_processor:
      train:
        name: "alpro_video_eval"
        n_frms: 8
        image_size: 224
    text_processor:
      train:
        name: "blip_caption"
run:
  task: video_text_pretrain

Then I got error:

File "Video-LLaMA/video_llama/models/modeling_llama.py", line 517, in forward
        position_ids = position_ids.view(-1, seq_length).long()
RuntimeError: shape '[-1, 136]' is invalid for input of size 137

Can you tell me where I went wrong with my configuration? Thanks again