When run demo locally, in def upload_video() of video_llama\conversation\conversation_video.py, we first img_list.append(image_emb), then img_list.append(audio_emb);
However,
conv.append_message(conv.roles[0], "Close your eyes, open your ears and you imagine only based on the sound that: <ImageHere>. \
Close your ears, open your eyes and you see that <Video><ImageHere></Video>. \
Now answer my question based on what you have just seen and heard.")
means first match audio prompt with img_list[0], and video prompt with img_list[1].
Did I mistake it?
When run demo locally, in def upload_video() of video_llama\conversation\conversation_video.py, we first
img_list.append(image_emb)
, thenimg_list.append(audio_emb)
; However,means first match audio prompt with img_list[0], and video prompt with img_list[1]. Did I mistake it?