def upload_video() in video_llama\conversation\conversation_video.py seems wrong?

When run demo locally, in def upload_video() of video_llama\conversation\conversation_video.py, we first img_list.append(image_emb), then img_list.append(audio_emb); However,

conv.append_message(conv.roles[0], "Close your eyes, open your ears and you imagine only based on the sound that: <ImageHere>. \
                Close your ears, open your eyes and you see that <Video><ImageHere></Video>.  \
                Now answer my question based on what you have just seen and heard.")

means first match audio prompt with img_list[0], and video prompt with img_list[1]. Did I mistake it?

DAMO-NLP-SG / Video-LLaMA

def upload_video() in video_llama\conversation\conversation_video.py seems wrong? #80