DAMO-NLP-SG / Video-LLaMA

[EMNLP 2023 Demo] Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
BSD 3-Clause "New" or "Revised" License
2.77k stars 255 forks source link

def upload_video() in video_llama\conversation\conversation_video.py seems wrong? #80

Closed yjwang346 closed 1 year ago

yjwang346 commented 1 year ago

When run demo locally, in def upload_video() of video_llama\conversation\conversation_video.py, we first img_list.append(image_emb), then img_list.append(audio_emb); However,

conv.append_message(conv.roles[0], "Close your eyes, open your ears and you imagine only based on the sound that: <ImageHere>. \
                Close your ears, open your eyes and you see that <Video><ImageHere></Video>.  \
                Now answer my question based on what you have just seen and heard.")

means first match audio prompt with img_list[0], and video prompt with img_list[1]. Did I mistake it?

hangzhang-nlp commented 1 year ago

Thanks a lot, have fixed it.