CircleRadon / TokenPacker

The code for "TokenPacker: Efficient Visual Projector for Multimodal LLM".
148 stars 6 forks source link

Comparison with avg pooling #6

Closed johncaged closed 1 month ago

johncaged commented 2 months ago

Thanks for your great job! I’m quite curious about the performance comparison between TokenPacker and Average Pooling, because from my experience, the Pooling method converges faster and achieves better performance than other vision-language connector structures (such as resampler, etc.). It would be great if you could further take Pooling into account. Thanks again!

LiWentomng commented 2 months ago

@johncaged Hello, thanks for your question. In the table 5 of our paper, the "Baseline" is average pooling to perform further ablations. We will add the detailed performance of average pooling as a separate method.