facebookresearch / segment-anything-2

The repository provides code for running inference with the Meta Segment Anything Model 2 (SAM 2), links for downloading the trained model checkpoints, and example notebooks that show how to use the model.
Apache License 2.0
10.69k stars 860 forks source link

Change image resolution #138

Open 25benjaminli opened 1 month ago

25benjaminli commented 1 month ago

Similar to SAM 1, SAM 2 was trained on 1024x1024 images. I'm wondering whether it's possible to adapt SAM 2 to 512x512 images without resizing those images (since that takes a lot of time). This was possible in the old SAM 1 by resizing the positional embeddings, but I'm not sure how to go about it in this new repository. After changing image_size in the config, the only thing that seemed to cause an error was

feats = [
            feat.permute(1, 2, 0).view(1, -1, *feat_size)
            for feat, feat_size in zip(vision_feats[::-1], self._bb_feat_sizes[::-1])
][::-1]

I know SAM 1 uses a typical VIT, but SAM 2 uses hiera (which I'm not very familiar with). I changed the features in SAM2ImagePredictor from:

self._bb_feat_sizes = [
            (256, 256),
            (128, 128),
            (64, 64),
        ]

to:

self._bb_feat_sizes = [
            (128, 128),
            (64, 64),
            (32, 32),
        ]     

and no errors were outputted, but I am not sure if the segmentation is working properly. Would appreciate feedback on this approach or any alternative methods!

brisyramshere commented 1 month ago

do you find solution?

heyoeyo commented 1 month ago

do you find solution?

The change mentioned to the feature sizes list along with a corresponding change to the image_size setting (inside the model .yaml configs) should work. More generally, I think you can auto calculate the feature size list in the SAM2ImagePredictor with something like:

# Spatial dim for backbone feature maps
hires_size = self.model.image_size // 4
self._bb_feat_sizes = [[hires_size // (2**k)]*2 for k in range(3)]

This way you only need to change the image_size setting inside the model .yaml files (though only certain image sizes will work without other changes, multiples of 128 should be safe).

...or any alternative methods

If you just want to quickly test out how the models perform at different sizes, I have a script (run_image.py) that can do this interactively using either SAM v1 or v2 models. You can adjust the processing size to 512 by adding a flag -b 512 when running the script.

brisyramshere commented 1 month ago

do you find solution?

The change mentioned to the feature sizes list along with a corresponding change to the image_size setting (inside the model .yaml configs) should work. More generally, I think you can auto calculate the feature size list in the SAM2ImagePredictor with something like:

# Spatial dim for backbone feature maps
hires_size = self.model.image_size // 4
self._bb_feat_sizes = [[hires_size // (2**k)]*2 for k in range(3)]

This way you only need to change the image_size setting inside the model .yaml files (though only certain image sizes will work without other changes, multiples of 128 should be safe).

...or any alternative methods

If you just want to quickly test out how the models perform at different sizes, I have a script (run_image.py) that can do this interactively using either SAM v1 or v2 models. You can adjust the processing size to 512 by adding a flag -b 512 when running the script.

Thank you very much. But since the pre-trained model is of 1024 resolution, will the performance decline to a certain extent after changing it to 768 or 512 resolution?

heyoeyo commented 1 month ago

will the performance decline to a certain extent

Yes, the shape of the masks will get more blocky looking at lower resolutions and eventually masking fails entirely (or at least requires more prompts to get a good result). The reverse is also (somewhat) true, higher resolutions (above 1024) give cleaner mask edges, though at some point the masking starts to fail.

Here's a comparison between 256, 512, 1024 and 2048 resolutions using SAMv2-large with a single foreground prompt point (same for all examples):

sam2_res_comparison The 4 masks output by the model are shown on the right side of each picture. The 1024 & 512 resolutions are not so different, but you can see (especially in the masks) that the results are more unstable for 256 or 2048.

Edit: For fun, here's the result from 4096x4096, where it still seems to work though it takes >5GB of VRAM. It gets near pixel-perfect edges: seg_4096

25benjaminli commented 1 month ago

@heyoeyo Thanks for the info. Your repository is impressive. It does make sense that the segmentations would get more blocky as the images get smaller. This is because, if I remember correctly, the decoder outputs a mask four times smaller in width & height (so upscaling can result in it looking blocky).

This is a tangentially related question, but have you found segment anything 2 to create better predictions (at least purely based on observation) compared to segment anything 1?

heyoeyo commented 1 month ago

have you found segment anything 2 to create better predictions

The results can be very similar when allowing for switching between mask outputs (there are major differences in terms of which outputs hold 'whole object' vs. 'sub-component' results between v1 and v2). Here's a comparison of v1 large (left) vs. v2 large (right) using the mask outputs that target sub-components:

v1_vs_v2_large

It's worth noting the resource usage for this example: Version Model size (MB) Speed (ms) VRAM (MB)
V1 large 1200 200 2300
V2 large 900 150 750

Since the resource usage of v2 is so much lower than v1, it's arguably better in general. I would also say v2 base is significantly better than v1 base (better speed, VRAM & mask quality).

The only noticeable advantage of v1 is that is gives more intuitive results when using box prompts, at least compared to the v2 large & small models (base and tiny variants give results similar to v1).