Hello, thanks for your interesting work.
In Section 4.2 of your research paper, you are mentioning captions generated by Cap3D in its captioning setup or by Cap3D in its VQA setup. I am wondering which BLIP-2 model you used to obtain such captions in the captioning setup (no VQA).
Have you used the finetuned BLIP-2 for captioning _"caption_cocoflant5xl" or the original xxl model _"pretrainflant5xxl" without input prompt?
By looking at this file, it looks like you used the original model pretrain_flant5xxl for both the setups...
Hello, thanks for your interesting work. In Section 4.2 of your research paper, you are mentioning captions generated by Cap3D in its captioning setup or by Cap3D in its VQA setup. I am wondering which BLIP-2 model you used to obtain such captions in the captioning setup (no VQA). Have you used the finetuned BLIP-2 for captioning _"caption_cocoflant5xl" or the original xxl model _"pretrainflant5xxl" without input prompt?
By looking at this file, it looks like you used the original model pretrain_flant5xxl for both the setups...
Thanks in advance, Andrea