LLaVA-VL / LLaVA-NeXT

1.01k stars 55 forks source link

Only output [1, 2] tokens for 'lmms-lab/LLaVA-NeXT-Video-7B-DPO' video demo inference #52

Open LeonLIU08 opened 3 weeks ago

LeonLIU08 commented 3 weeks ago

the output of output_ids is tensor([[1, 2]], device='cuda:0') Other output of the demo script is:

Question: A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: Please provide a detailed description of the video, focusing on the main subjects, their actions, and the background scenes ASSISTANT:

Response:

ZhangYuanhan-AI commented 3 weeks ago

Could you please inform me with the command you used.

LeonLIU08 commented 3 weeks ago

The command: bash scripts/video/demo/video_demo.sh lmms-lab/LLaVA-NeXT-Video-7B-DPO vicuna_v1 32 2 True xxx.mp4

By the way, I found using pool_stride=4 can solve this, because the input token length with stride=2 is 4673 which is larger than the max_length of LLM (4096).