Why do you prefer to choose CLIP embedding to calculate high-level similarity instead of others? Are there any considerations?

I guess the CLIP feature extractor can be considered a general-purpose feature extractor as it is trained on a large amount of internet images.

Other considerations are:

Vannila ViT that is trained on ImageNet data. We doubt their training data might not be as general as CLIP.
Document encoder that is trained on text-intensive image data. Since we masked all the texts before calculating the high-level similarity, this might not be a good fit.

In general, we don't see any specialized visual encoder as a very good fit, so we just go with a general and popular one.

If you have any suggestions, please let us know.

NoviScl / Design2Code