Closed daixiangzi closed 1 month ago
@daixiangzi In this work, we conducted the limited groups of multi-level features due to the numerous layers in ViT-based vision encoders. There are more suitable layer groups or better combination manner for better performance. Besides, some prior researches also showed that utilizing multiple layers can improve the performance of MLLMs.
I found from the paper that multi-layer feature layer ablation does not seem to improve performance much.