bfshi / scaling_on_scales

When do we not need larger vision models?
MIT License
277 stars 9 forks source link

Why the res is 1008 in paper? #8

Open OpenJarvisAI opened 2 months ago

OpenJarvisAI commented 2 months ago

I know that, the equalant would be like 3x336, but may I ask why it's 3?

Actually you have sent totally 14 images into vit (1 + 2x2 + 3x3)=14, and you compare with llava with single 1x3x336x336. Your input should be like 1x3x14x336x336 , almost 5 times larges footprints.

this is totally inbalanced.

I think it should compare with llava 336 -> interpolate input to 786, so that it can sit on same table to compare...

bfshi commented 2 months ago

Hi @OpenJarvisAI,

Thanks for the comment. Yeah in the paper we compare s2 versus directly extracting features from larger image without splitting (Table 12), and it turns out it's much more inefficient and has worse performance than s2.