NoviScl / Design2Code

MIT License
443 stars 33 forks source link

Why do you prefer to choose CLIP embedding to calculate high-level similarity instead of others? Are there any considerations? #19

Closed Crankyxx closed 7 months ago

StevenyzZhang commented 7 months ago

I guess the CLIP feature extractor can be considered a general-purpose feature extractor as it is trained on a large amount of internet images.

Other considerations are:

  1. Vannila ViT that is trained on ImageNet data. We doubt their training data might not be as general as CLIP.
  2. Document encoder that is trained on text-intensive image data. Since we masked all the texts before calculating the high-level similarity, this might not be a good fit.

In general, we don't see any specialized visual encoder as a very good fit, so we just go with a general and popular one.

If you have any suggestions, please let us know.