Confusion CLIP and BLIP in MiniGPT-v2

Vision-CAIR / MiniGPT-4

Open-sourced codes for MiniGPT-4 and MiniGPT-v2 (https://minigpt-4.github.io, https://minigpt-v2.github.io/)

https://minigpt-4.github.io

BSD 3-Clause "New" or "Revised" License

25.14k stars 2.9k forks source link

Confusion CLIP and BLIP in MiniGPT-v2 #437

Open becauseofAI opened 7 months ago

becauseofAI commented 7 months ago

We know that MiniGPT-v2 consists of Visual backbone (ViT), Linear projection layer (Linear), and Large language model (LLaMA2), where Visual backbone uses the ViT of eva_clip_g. However, vis_processor and text_processor respectively use blip2_image_train and blip_caption.

So my question is, what is the relationship between eva_clip_g and blip2_image_train, blip_caption?

shwetabhardwaj44 commented 7 months ago

I think blip2_image_train is to call blip_preprocessor for processing raw image (just resizing and loading as tensor).