I know that, the equalant would be like 3x336, but may I ask why it's 3?
Actually you have sent totally 14 images into vit (1 + 2x2 + 3x3)=14, and you compare with llava with single 1x3x336x336. Your input should be like 1x3x14x336x336 , almost 5 times larges footprints.
this is totally inbalanced.
I think it should compare with llava 336 -> interpolate input to 786, so that it can sit on same table to compare...
Thanks for the comment. Yeah in the paper we compare s2 versus directly extracting features from larger image without splitting (Table 12), and it turns out it's much more inefficient and has worse performance than s2.
I know that, the equalant would be like 3x336, but may I ask why it's 3?
Actually you have sent totally 14 images into vit (1 + 2x2 + 3x3)=14, and you compare with llava with single 1x3x336x336. Your input should be like 1x3x14x336x336 , almost 5 times larges footprints.
this is totally inbalanced.
I think it should compare with llava 336 -> interpolate input to 786, so that it can sit on same table to compare...