cambridgeltl / visual-spatial-reasoning

[TACL'23] VSR: A probing benchmark for spatial undersranding of vision-language models.
Apache License 2.0
90 stars 7 forks source link

I have a CLIP implementation #2

Open Sohojoe opened 1 year ago

Sohojoe commented 1 year ago

Hi there,

thank you for the dataset.

I've implemented a CLIP benchmark of the dataset -> CLIP_visual-spatial-reasoning

I found I was able to go from 50% to ~55% true zero shot (i.e. no retraining at all) through prompt engineering. I'm implementing retraining now and will keep updating with the results.

hardyqr commented 1 year ago

Thanks a lot for this! I was also thinking about CLIP baselines — so happy to see that it’s already being done so nicely :)

Please do keep us posted.