baaivision / EVE

[NeurIPS'24 Spotlight] EVE: Encoder-Free Vision-Language Models
MIT License
236 stars 3 forks source link

What is the motivation behind Patch Aligning Layer (PAL)? #9

Closed ThisisBillhe closed 3 months ago

ThisisBillhe commented 4 months ago

Thanks for your great work! However, I do not fully understand the functions of PAL. According to the paper, PAL is connected to the output of LLM and is forced to align with CLIP features. Why do the output features of LLM need to be aligned with CLIP features and how does it help EVE?

Paranioar commented 3 months ago

Thanks for your great work! However, I do not fully understand the functions of PAL. According to the paper, PAL is connected to the output of LLM and is forced to align with CLIP features. Why do the output features of LLM need to be aligned with CLIP features and how does it help EVE?

(i) We introduce PAL to improve training efficiency, especially with our moderate data scale, which helps the learning process of vision perception from scratch. (ii) Actually, its role diminishes over large data scales in our experiments. This may be because large amounts of high-quality and hyper-detailed captions greatly enhance the understanding of visual information.