CircleRadon / TokenPacker

The code for "TokenPacker: Efficient Visual Projector for Multimodal LLM".
148 stars 6 forks source link

About multi-level features #7

Closed daixiangzi closed 1 month ago

daixiangzi commented 1 month ago

I found from the paper that multi-layer feature layer ablation does not seem to improve performance much. image

LiWentomng commented 1 month ago

@daixiangzi In this work, we conducted the limited groups of multi-level features due to the numerous layers in ViT-based vision encoders. There are more suitable layer groups or better combination manner for better performance. Besides, some prior researches also showed that utilizing multiple layers can improve the performance of MLLMs.