DAMO-NLP-SG / Video-LLaMA

[EMNLP 2023 Demo] Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
BSD 3-Clause "New" or "Revised" License
2.77k stars 255 forks source link

A demo without gradio #140

Open liboliba opened 9 months ago

liboliba commented 9 months ago

Hello, Thanks for the gradio example. But I wonder if there are examples of reading in video file and then Q&A in command line without using the gradio example since my GPUs are offline and does not need gradio. It is also a bit confusing for people to understand the demo if they do not want to use gradio/unfamiliar with it.

Thank you.

llx-08 commented 7 months ago

Hi, you can try extracting gradio's inference operations manually, as in the following code

if args.model_type == 'vicuna':
    chat_state = default_conversation.copy()
else:
    chat_state = conv_llava_llama_2.copy()

video_path = "your_path"
chat_state.system = ""
img_list = []
llm_message = chat.upload_video(video_path , chat_state, img_list)

while True:
    user_message = input("User/ ")

    chat.ask(user_message, chat_state)

    num_beams = 2
    temperature = 1.0

    llm_message = chat.answer(conv=chat_state,
                                  img_list=img_list,
                                  num_beams=num_beams,
                                  temperature=temperature,
                                  max_new_tokens=300,
                                  max_length=2000)[0]
    print(chat_state.get_prompt())
    print(chat_state)
    print(llm_message)