facebookresearch / segment-anything-2

The repository provides code for running inference with the Meta Segment Anything Model 2 (SAM 2), links for downloading the trained model checkpoints, and example notebooks that show how to use the model.
Apache License 2.0
10.87k stars 896 forks source link

image_size vs window_spec #225

Open rwightman opened 1 month ago

rwightman commented 1 month ago

The default image_size appears to be 1024 for all models, but the window spec for T, S & B+ appear to be optimimal for image sizes that are multiples of 224 (ie 896 or 1120 would be better), only the L will work with multiples of 256 w/ extra padding overhead for the windowing...

EDIT: for L, that should have read 'w/o' 'without extra padding overhead'

Curious if there was a reason for this?

chayryali commented 3 weeks ago

Hi Ross, For the L model, the window spec results in ~30% more tokens per window and so attention could become costlier but does result in some (small) improvement in accuracy; if a user is using the L model, perhaps they are more interested in accuracy, so the window sizes are chosen to be a bit larger. In practice, depending on the hardware and kernels used, differences in speed and memory maybe end up being quite small.

heyoeyo commented 2 weeks ago

Hi @chayryali, I think what @rwightman was getting at is that for a 1024px input, the 4 stages of the image encoder will have feature maps with side lengths of 256, 128, 64, 32 and the last two window sizes of the non-large models don't tile into this cleanly. This makes extra padding steps necessary for the non-large models, so (slight) changes to the window size could lead to improvements.

For example, here are some timings for different window sizes (using the v2-base+ model):

Window Sizes Encoder Time (ms)
8, 4, 14, 7 21.4 (original)
8, 4, 16, 8 20.3
8, 8, 8, 8 19.6
16, 16, 16, 16 19.5 (interesting!)
12, 12, 12, 12 21.8
2, 2, 2, 2 37.9

Incidentally, SAMv1 also used non-tiling sizes and can even benefit from modifying it (see v1 PR #594). Though from my testing, the v2 models are very sensitive to the stage 2 & 3 sizes (based on a visualization script, for anyone interested), so it seems additional training would be needed to support updated sizes.

rwightman commented 2 weeks ago

@chayryali what @heyoeyo said was excatly what I was getting at.