Closed zetyquickly closed 1 year ago
Hello ! I honestly did not experimented much with BLIP models and used the one lambdalabs used in their work. It corresponds to the BLIP w/ ViT-B and CapFilt-L. Using BLIP w/ ViT-L on CoCo captioning might give different results. This is what they use in their demo on HuggingFace's space https://huggingface.co/spaces/Salesforce/BLIP
I see. Thank you, I think it's up to experiment, which BLIP model to choose.
Btw, I haven't found info which BLIP model lambdalabs used exactly. Tried all image-captioning models and got the following results on pokemon dataset.
none of them exactly gives "a red and white ball with an angry look on its face"
Indeed they just say to use https://github.com/salesforce/BLIP. Did you try to run the generation using the same model several times? You should not get the same caption if you use nucleus_sampling. Maybe this is what they did? You could ask them in their repository
Hello,
Have you experimented with different BLIP models? What's the best choice was in your Magic/OnePiece case?