I guess the CLIP feature extractor can be considered a general-purpose feature extractor as it is trained on a large amount of internet images.
Other considerations are:
Vannila ViT that is trained on ImageNet data. We doubt their training data might not be as general as CLIP.
Document encoder that is trained on text-intensive image data. Since we masked all the texts before calculating the high-level similarity, this might not be a good fit.
In general, we don't see any specialized visual encoder as a very good fit, so we just go with a general and popular one.
I guess the CLIP feature extractor can be considered a general-purpose feature extractor as it is trained on a large amount of internet images.
Other considerations are:
In general, we don't see any specialized visual encoder as a very good fit, so we just go with a general and popular one.
If you have any suggestions, please let us know.