OpenGVLab / Ask-Anything

[CVPR2024 Highlight][VideoChatGPT] ChatGPT with video understanding! And many more supported LMs such as miniGPT4, StableLM, and MOSS.
https://vchat.opengvlab.com/
MIT License
3k stars 247 forks source link

VideoChat2 Demo returning garbage output #81

Closed AmitRozner closed 9 months ago

AmitRozner commented 9 months ago

Hi, I tried to install VideoChat2 locally and use your demo via demo.py and demo.ipynb. The models are loading and inference is happening but I get garbage as answer (for the demo imges/videos as well). For example:

['Assistant', 'The M/Difference and disadvantages to the most common/\nI/D-G, they/D/N, without/ A, but we can/ A[D/D/D/Mut/\nA/ D/ M/D/ [D/D/GD/G\n[DR/[/DI/\nFur/Fur/G/D/G,D/G/D/GD/T/G,D/G/Fore,D/G/D/G,  /D, /D/G/G/D,G/GD/HDD[D/D/G[/D/G/D/GD/G/,/G/D/D/G/D/D/GH[D/D/G[D/G/H/G/D/D/G/G/D/G/G/D/D/GD/H,/G/D/GH/D/GH/D/G/G/D/G/G/D[G/D/G[B/G/[D/G/G/D/G/D/G/D/G/D/D[G/H/G/D/G/D/G/G/G/[/D/GH/GD/G/GH/G/GH/D[D/GH/D[D/D/G/G[H/D/G/D/G/GH/D/G/GH/[GH/GD/G/D/G/D/G/GH/GH[/GH/GH/D/GH/G/D/H/G/D/G/GH/G/D/G/GH/G/D/GH[B/H/GH/G/H/GH/G/D/GH/G/GH/G/D/H/G/H[/GH/G/G/GH/D/H/GH/G/G/G/GH/D/GH/G/G/H/G/GH/H/H/G/D/GH/H[D/GH[D/G/D/G/H/G/G/G/D/G/GH/G/GH/GH/D/H/GH/GH/G/G/G/D/GH/D/GH/G/G/G/G/GH/D/G/G/GH/G/H/G/G/D/G/GH/GH/G/G/G/G/G/GH/D/GH/D/G/G/H/D/GH/G/G/H/G/G/G/G/G/H/G/GH/D/H/G/H/G/D/G/D/G/D/G/H/GH/G/GH/H/D/G/G/G/G/H/GH/GH/G/G/H/G/GH/G/H/G/G/H/G/G/H/H/G/H/D/H/G/H/G/G/G/G/H/H/G/GH/G/G/H/H/H/H/G/G/H/G/H/G/G/H/G/H/G/H/G/GH/G/G/H/H/H/GH/H/H/G/H/G/H/H/G/G/G/GH/G/H/H/G/G/H/H/G/H/H/G/H/H/G/H/G/H/G/H/G/H/G/G/H/G/H/G/G/H/G/H/G/H/G/H/G/HGH/G/H/G/G/H/H/H/G/H/H/H/G/H/GH/H/G/G/H/G/H/G/G/H/H/G/G/H/H/G/H/G/H/G/H/H/G/H/G/H/G/H/G/']], 'sep': '###'}
Answer: The M/Difference and disadvantages to the most common/
I/D-G, they/D/N, without/ A, but we can/ A[D/D/D/Mut/
A/ D/ M/D/ [D/D/GD/G
[DR/[/DI/
Fur/Fur/G/D/G,D/G/D/GD/T/G,D/G/Fore,D/G/D/G,  /D, /D/G/G/D,G/GD/HDD[D/D/G[/D/G/D/GD/G/,/G/D/D/G/D/D/GH[D/D/G[D/G/H/G/D/D/G/G/D/G/G/D/D/GD/H,/G/D/GH/D/GH/D/G/G/D/G/G/D[G/D/G[B/G/[D/G/G/D/G/D/G/D/G/D/D[G/H/G/D/G/D/G/G/G/[/D/GH/GD/G/GH/G/GH/D[D/GH/D[D/D/G/G[H/D/G/D/G/GH/D/G/GH/[GH/GD/G/D/G/D/G/GH/GH[/GH/GH/D/GH/G/D/H/G/D/G/GH/G/D/G/GH/G/D/GH[B/H/GH/G/H/GH/G/D/GH/G/GH/G/D/H/G/H[/GH/G/G/GH/D/H/GH/G/G/G/GH/D/GH/G/G/H/G/GH/H/H/G/D/GH/H[D/GH[D/G/D/G/H/G/G/G/D/G/GH/G/GH/GH/D/H/GH/GH/G/G/G/D/GH/D/GH/G/G/G/G/GH/D/G/G/GH/G/H/G/G/D/G/GH/GH/G/G/G/G/G/GH/D/GH/D/G/G/H/D/GH/G/G/H/G/G/G/G/G/H/G/GH/D/H/G/H/G/D/G/D/G/D/G/H/GH/G/GH/H/D/G/G/G/G/H/GH/GH/G/G/H/G/GH/G/H/G/G/H/G/G/H/H/G/H/D/H/G/H/G/G/G/G/H/H/G/GH/G/G/H/H/H/H/G/G/H/G/H/G/G/H/G/H/G/H/G/GH/G/G/H/H/H/GH/H/H/G/H/G/H/H/G/G/G/GH/G/H/H/G/G/H/H/G/H/H/G/H/H/G/H/G/H/G/H/G/H/G/G/H/G/H/G/G/H/G/H/G/H/G/H/G/HGH/G/H/G/G/H/H/H/G/H/H/H/G/H/GH/H/G/G/H/G/H/G/G/H/H/G/G/H/H/G/H/G/H/G/H/H/G/H/G/H/G/H/G/

I tried both video and image but the same happens. I am using the following models:

    "model_cls": "VideoChat2_it",
    "vit_blip_model_path": "./umt_l16_qformer.pth",
    "llama_model_path": "./vicuna-7b-v0",
    "videochat2_model_path": "./videochat2_7b_stage2.pth",

And in demo.py: state_dict = torch.load("./videochat2_7b_stage3.pth", "cpu")

I am using Ubuntu 20.04, Python 3.9, CUDA 11.8. Any clue why could this happen?

Andy1621 commented 9 months ago

It's mainly because you do not use the right vicuna-v0. Please follow the right steps to handle it. Similar issues can be found in https://github.com/Vision-CAIR/MiniGPT-4/issues/12

AmitRozner commented 9 months ago

Thanks, this solved the issue.