Closed mathildecaron31 closed 1 year ago
Hi @mathildecaron31, thank you for your interest in EVA and your valuable question.
We believe using image-text paired data is a kind of self-supervised learning now, because now more and more images / videos naturally come with text / caption (I think it is easier to find an image w/ text than an image w/o text now). Image-text pairs are abundant on the Internet, and the data collocation process is scalable. So it is OK to say images are born with text now.
Notice that here and here are results from EVA-CLIP, so we not only ''distill from OpenAI CLIP encoders'', we also directly use LAION-400M image-text data for training.
Then maybe would be nice to add this information somehow in the table so that it is clear to the reader that you are using text supervision given that, unlike EVA, all the other methods in the table don't use any text supervision.
Thanks a lot for your answer and congrats again on the work :)
Hi,
Congrats on this interesting work.
I am wondering whether the comparisons with previous works here and here are really fair given that you are sort of distilling from CLIP encoders (i.e. which use text supervision).