bfshi / scaling_on_scales

When do we not need larger vision models?
MIT License
277 stars 9 forks source link

Questions about paper #5

Open JihwanEom opened 4 months ago

JihwanEom commented 4 months ago

Hello,

I've read your paper and have a few questions that I wanted to discuss:

  1. In Figure 1 of the paper, pooling is done through token averaging. Have you tried any other pooling methods? I understand that the approach averages out patch tokens for each split (for instance, averaging [0~3, 0, 0] if the patch token at the very edge on the left of the first picture is [0, 0, 0]). However, there seems to be no ablation study on why this design was chosen. While averaging appears to be a reasonable approach, I wonder if there have been any experiments regarding this.

  2. In Figure 7(a), when comparing the ViT-L and ViT-B-S^2 models, it's noted that ViT-L exhibits slightly better generalization. How about to interpret this from a shape-texture bias perspective? (curious about your thoughts!!) For instance, a television predicted as a chestplate might look very similar in texture to an actual chestplate, and a flute predicted as a triceratops could also be perceived as similar in texture. It seems that the inclusion of multi-scale features might enhance more local (closer to texture) features, and I'd like your insight on this.

  3. I keep coming back to the concept of token average pooling :). In the case of the picture in Figure 1, it seems possible to adjust the number of tokens using a resampler for the 32x32x768 setup. I'm curious if any experiments have been conducted regarding this.

Thank you for sharing these interesting results!

Best, Jihwan

bfshi commented 4 months ago

Thanks for the questions Jihwan!

  1. Yes we've tried 2x2 conv instead of 2x2 average pooling. Turns out there's no significant difference in terms of performance.

  2. Yeah that's a good point. My intuition is more image scales brings larger model capacity and thus better generalization (given sufficient data). We didn't test the shape-texture bias of the model, and it's not clear what's the relation between shape-texture bias and generalization.

  3. Yeah a resampler can extract relevant information from lots of tokens without introducing too much computation cost. It can also avoid the information loss caused by average pooling the 32x32 feature map to 16x16. We only tried on LLaVA-like model and didn't try the resampler. I think this could help and welcome to try it out! I will be interested to know about the results of that:)

Best, Baifeng