Closed johncaged closed 1 month ago
@johncaged
Hello, thanks for your question. In the table 5 of our paper, the "Baseline
" is average pooling to perform further ablations. We will add the detailed performance of average pooling as a separate method.
Thanks for your great job! I’m quite curious about the performance comparison between TokenPacker and Average Pooling, because from my experience, the Pooling method converges faster and achieves better performance than other vision-language connector structures (such as resampler, etc.). It would be great if you could further take Pooling into account. Thanks again!