Closed ArtemBernatskyy closed 9 months ago
Thanks for your question!
For video_chat, you can simplify the problem as a detailed video description. We use the models for image captioning, dense captioning with boxes, video classification, and so on. Those information will be input in LLM like ChatGPT to understand the video. More importantly, we also design the prompt so that the LLM can understand the time
.
I see, can you describe it in a little more detail plz? Especially workflow to describing images/video. Thx!
I think it to be two stage. First stage, prepare the explicit information with natural language format instead of latent code or embedding:
tag2text
), dense caption model(grit
), intern action model(uniformerv2
), and use summarizing model(a T5 model) for tag2text
output. These models intend to recognize “what in the video” and "what the man in the video is doing".
Now we can get some information about the video, in natural language format.Second stage, use the output of models in the first stage to write a init prompt for chatgpt, tell the chatbot what it has, and what it needs to do, also add your questions after the init prompt and feed them all to the openai API
However, just curious, what's the diffierent between tag2text
and grit
, they are all seemed to do the dense caption work. Or I get anything wrong?
Thanks for your great reply! We use tag2text
mainly to get a time description, but grit
can provide us with richer character descriptions and bounding boxes. LLM can use the bounding box and the clothing information of the characters to determine whether it is the same ID, and His trajectory!
Thanks for your great reply! We use
tag2text
mainly to get a time description, butgrit
can provide us with richer character descriptions and bounding boxes. LLM can use the bounding box and the clothing information of the characters to determine whether it is the same ID, and His trajectory!
Still wondering...
dense caption information here (which is the grit
output)
https://github.com/OpenGVLab/Ask-Anything/blob/d3478208dc972b32e300e9ca955dde1f4afc3013/video_chat/app.py#L72
and the image caption information here (which is the tag2text
output)
https://github.com/OpenGVLab/Ask-Anything/blob/d3478208dc972b32e300e9ca955dde1f4afc3013/video_chat/app.py#L86
I think maybe the "time description" is added here: f"Second {i+1}
I can't figure out the different between tag2text
and grit
, they seemed to equip the same function?
Here is the tag2text
Here is the grit
They seemed similar, I got anything wrong?
You show the 2023/04/20 updated tag2text with SAM, we use the original version.
You show the 2023/04/20 updated tag2text with SAM, we use the original version.
This is from the tag2text
paper, but still can't figure out the why "time description", maybe I think it relative to “tag guidance”?
Hi, sorry for the late reply, time description is our manual prompt
Due to the lack of updates for a long time, your issue has been temporarily closed. If you still have any problems, please feel free to reopen this issue.
Thx for awesome job!
I have tried to understand how it works judging from the code but it's hard for me:(
How does it describes what's happening on the images? What tech do you use? Thx!