YaYaB / finetune-diffusion

MIT License
109 stars 9 forks source link

What's the best BLIP checkpoint #2

Closed zetyquickly closed 1 year ago

zetyquickly commented 1 year ago

Hello,

Have you experimented with different BLIP models? What's the best choice was in your Magic/OnePiece case?

YaYaB commented 1 year ago

Hello ! I honestly did not experimented much with BLIP models and used the one lambdalabs used in their work. It corresponds to the BLIP w/ ViT-B and CapFilt-L. Using BLIP w/ ViT-L on CoCo captioning might give different results. This is what they use in their demo on HuggingFace's space https://huggingface.co/spaces/Salesforce/BLIP

zetyquickly commented 1 year ago

I see. Thank you, I think it's up to experiment, which BLIP model to choose.

Btw, I haven't found info which BLIP model lambdalabs used exactly. Tried all image-captioning models and got the following results on pokemon dataset.

Screen Shot 2022-10-27 at 9 06 00 AM

none of them exactly gives "a red and white ball with an angry look on its face"

YaYaB commented 1 year ago

Indeed they just say to use https://github.com/salesforce/BLIP. Did you try to run the generation using the same model several times? You should not get the same caption if you use nucleus_sampling. Maybe this is what they did? You could ask them in their repository