What is the motivation behind Patch Aligning Layer (PAL)?

baaivision / EVE

[NeurIPS'24 Spotlight] EVE: Encoder-Free Vision-Language Models

MIT License

236 stars 3 forks source link

Thanks for your great work! However, I do not fully understand the functions of PAL. According to the paper, PAL is connected to the output of LLM and is forced to align with CLIP features. Why do the output features of LLM need to be aligned with CLIP features and how does it help EVE?

(i) We introduce PAL to improve training efficiency, especially with our moderate data scale, which helps the learning process of vision perception from scratch. (ii) Actually, its role diminishes over large data scales in our experiments. This may be because large amounts of high-quality and hyper-detailed captions greatly enhance the understanding of visual information.

baaivision / EVE

What is the motivation behind Patch Aligning Layer (PAL)? #9