The input of CLIP is still 336

NVlabs / EAGLE

EAGLE: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders

https://arxiv.org/pdf/2408.15998

Apache License 2.0

541 stars 45 forks source link

Open zhlhlhlhl opened 1 month ago

zhlhlhlhl commented 1 month ago

I checked the shape of the input x and output feature of the CLIP VIT; it seems that it's still 336, not 448. Is there anything wrong?