OpenGVLab / Ask-Anything

[CVPR2024 Highlight][VideoChatGPT] ChatGPT with video understanding! And many more supported LMs such as miniGPT4, StableLM, and MOSS.
https://vchat.opengvlab.com/
MIT License
2.95k stars 242 forks source link

How does it work? #20

Closed ArtemBernatskyy closed 7 months ago

ArtemBernatskyy commented 1 year ago

Thx for awesome job!

I have tried to understand how it works judging from the code but it's hard for me:(

How does it describes what's happening on the images? What tech do you use? Thx!

Andy1621 commented 1 year ago

Thanks for your question! For video_chat, you can simplify the problem as a detailed video description. We use the models for image captioning, dense captioning with boxes, video classification, and so on. Those information will be input in LLM like ChatGPT to understand the video. More importantly, we also design the prompt so that the LLM can understand the time.

ArtemBernatskyy commented 1 year ago

I see, can you describe it in a little more detail plz? Especially workflow to describing images/video. Thx!

cocoshe commented 1 year ago

I think it to be two stage. First stage, prepare the explicit information with natural language format instead of latent code or embedding:

  1. load video, which means frames of images.
  2. use image scale caption model(tag2text), dense caption model(grit), intern action model(uniformerv2), and use summarizing model(a T5 model) for tag2text output. These models intend to recognize “what in the video” and "what the man in the video is doing". Now we can get some information about the video, in natural language format.

Second stage, use the output of models in the first stage to write a init prompt for chatgpt, tell the chatbot what it has, and what it needs to do, also add your questions after the init prompt and feed them all to the openai API

However, just curious, what's the diffierent between tag2text and grit, they are all seemed to do the dense caption work. Or I get anything wrong?

yinanhe commented 1 year ago

Thanks for your great reply! We use tag2text mainly to get a time description, but grit can provide us with richer character descriptions and bounding boxes. LLM can use the bounding box and the clothing information of the characters to determine whether it is the same ID, and His trajectory!

cocoshe commented 1 year ago

Thanks for your great reply! We use tag2text mainly to get a time description, but grit can provide us with richer character descriptions and bounding boxes. LLM can use the bounding box and the clothing information of the characters to determine whether it is the same ID, and His trajectory!

Still wondering... dense caption information here (which is the grit output) https://github.com/OpenGVLab/Ask-Anything/blob/d3478208dc972b32e300e9ca955dde1f4afc3013/video_chat/app.py#L72 and the image caption information here (which is the tag2text output) https://github.com/OpenGVLab/Ask-Anything/blob/d3478208dc972b32e300e9ca955dde1f4afc3013/video_chat/app.py#L86 I think maybe the "time description" is added here: f"Second {i+1} I can't figure out the different between tag2text and grit, they seemed to equip the same function?

Here is the tag2text

tag2text

Here is the grit

grit

They seemed similar, I got anything wrong?

yinanhe commented 1 year ago

You show the 2023/04/20 updated tag2text with SAM, we use the original version.

cocoshe commented 1 year ago

You show the 2023/04/20 updated tag2text with SAM, we use the original version.

This is from the tag2text paper, but still can't figure out the why "time description", maybe I think it relative to “tag guidance”? image

yinanhe commented 1 year ago

Hi, sorry for the late reply, time description is our manual prompt

yinanhe commented 7 months ago

Due to the lack of updates for a long time, your issue has been temporarily closed. If you still have any problems, please feel free to reopen this issue.