Questionning the comparison with SSL methods on paperwithcode

mathildecaron31 commented 1 year ago

Hi,

Congrats on this interesting work.

I am wondering whether the comparisons with previous works here and here are really fair given that you are sort of distilling from CLIP encoders (i.e. which use text supervision).

Yuxin-CV commented 1 year ago

Hi @mathildecaron31, thank you for your interest in EVA and your valuable question.

We believe using image-text paired data is a kind of self-supervised learning now, because now more and more images / videos naturally come with text / caption (I think it is easier to find an image w/ text than an image w/o text now). Image-text pairs are abundant on the Internet, and the data collocation process is scalable. So it is OK to say images are born with text now.

Notice that here and here are results from EVA-CLIP, so we not only ''distill from OpenAI CLIP encoders'', we also directly use LAION-400M image-text data for training.

mathildecaron31 commented 1 year ago

Then maybe would be nice to add this information somehow in the table so that it is clear to the reader that you are using text supervision given that, unlike EVA, all the other methods in the table don't use any text supervision.

Yuxin-CV commented 1 year ago

Good suggestion. The PwC badges have been modified as follows:

\ \

mathildecaron31 commented 1 year ago

Thanks a lot for your answer and congrats again on the work :)

baaivision / EVA

Questionning the comparison with SSL methods on paperwithcode #23