Question about description generation

jpWang commented 3 months ago

Hi, thanks for this great work~ And I want to ask a question that in the paper, it said that the video description is generated by image caption model and the hallucination is inevitable. And what does "visual inputs within BLIP latent space" mean? I am confused about how hallucinations are filtered out.

Looking forward to your reply, thank you~

jpWang commented 3 months ago

@OmkarThawakar

OmkarThawakar commented 3 months ago

Hi @jpWang ,

Thanks for interest in our work.

To check the hallucination , we used BLIP and measure the similarity between visual embedding (middle frame given to BLIP vision encoder and the visual projection layer ) and text embedding (generated description given to BLIP text encoder and textual projection layer). We used the hallucination threshold as 0.2 after empirically checking with few examples. If similarity > 0.2 we considered it as good description whereas similarity below 0.2 has been recomputed/manually corrected.

Thanks,

jpWang commented 3 months ago

Thanks for your reply! And I am curious why BLIP is used as the filtering criterion? For example, why not consider using the CLIP model? BLIP-2 model? etc.

OmkarThawakar / composed-video-retrieval

Question about description generation #1