We know that MiniGPT-v2 consists of Visual backbone (ViT), Linear projection layer (Linear), and Large language model (LLaMA2), where Visual backbone uses the ViT of eva_clip_g. However, vis_processor and text_processor respectively use blip2_image_train and blip_caption.
So my question is, what is the relationship between eva_clip_g and blip2_image_train, blip_caption?
We know that MiniGPT-v2 consists of Visual backbone (ViT), Linear projection layer (Linear), and Large language model (LLaMA2), where Visual backbone uses the ViT of eva_clip_g. However, vis_processor and text_processor respectively use blip2_image_train and blip_caption.
So my question is, what is the relationship between eva_clip_g and blip2_image_train, blip_caption?