NVlabs / EAGLE

EAGLE: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders
https://arxiv.org/pdf/2408.15998
Apache License 2.0
541 stars 45 forks source link

The input of CLIP is still 336 #22

Open zhlhlhlhl opened 1 month ago

zhlhlhlhl commented 1 month ago

I checked the shape of the input x and output feature of the CLIP VIT; it seems that it's still 336, not 448. Is there anything wrong?

image