Open rwightman opened 1 month ago
Hi Ross, For the L model, the window spec results in ~30% more tokens per window and so attention could become costlier but does result in some (small) improvement in accuracy; if a user is using the L model, perhaps they are more interested in accuracy, so the window sizes are chosen to be a bit larger. In practice, depending on the hardware and kernels used, differences in speed and memory maybe end up being quite small.
Hi @chayryali, I think what @rwightman was getting at is that for a 1024px input, the 4 stages of the image encoder will have feature maps with side lengths of 256, 128, 64, 32 and the last two window sizes of the non-large models don't tile into this cleanly. This makes extra padding steps necessary for the non-large models, so (slight) changes to the window size could lead to improvements.
For example, here are some timings for different window sizes (using the v2-base+ model):
Window Sizes | Encoder Time (ms) |
---|---|
8, 4, 14, 7 | 21.4 (original) |
8, 4, 16, 8 | 20.3 |
8, 8, 8, 8 | 19.6 |
16, 16, 16, 16 | 19.5 (interesting!) |
12, 12, 12, 12 | 21.8 |
2, 2, 2, 2 | 37.9 |
Incidentally, SAMv1 also used non-tiling sizes and can even benefit from modifying it (see v1 PR #594). Though from my testing, the v2 models are very sensitive to the stage 2 & 3 sizes (based on a visualization script, for anyone interested), so it seems additional training would be needed to support updated sizes.
@chayryali what @heyoeyo said was excatly what I was getting at.
The default image_size appears to be 1024 for all models, but the window spec for T, S & B+ appear to be optimimal for image sizes that are multiples of 224 (ie 896 or 1120 would be better), only the L will work with multiples of 256 w/ extra padding overhead for the windowing...
EDIT: for L, that should have read 'w/o' 'without extra padding overhead'
Curious if there was a reason for this?