Open wfz666 opened 2 months ago
The main limitations on input sizing for vision transformers comes from the patch embedding step and positional encodings applied to the images tokens. Both SAMv1 & v2 use learned position encodings, which exist only for a single input size, which is probably why they're hard-coded for that size.
That being said, it's common to up/down-scale the position encodings to support different input sizes. The v2 model already does this, but the image segmentation code needs a slight modification to probably support it (see issue #138), the video segmentation supports it without modifications (see issue #257). The v1 model can also be modified to support different input sizes and in fact, the v1 model seems more robust to changes to the input resolution than the v2 model.
I made an override change to the image_size
in the build_sam2_video_predictor()
method and set it to 512 from the default 1024. The performance of the pre-trained model after this is much, much worse in comparison to when it was run at 1024 pixels.
I kind of expected that the pretrained weights would still perform reasonably in comparison. Can you point out if I'm doing something incorrectly? @heyoeyo
I'm using the huggingface_hub checkpoints, hence why I had to edit the method.
@divineSix If it runs without errors, it's most likely that the changes you made are correctly implemented. Though if you're getting results that just seem completely unlike the 1024 results, then maybe something about the changes isn't working.
It seems normal for the v2 results to degrade quite badly (especially compared to v1) at altered resolutions. Some of this is due to the IoU prediction often picking relatively bad masks, if you're using that to auto-select the mask. The other common issue I've seen is blocky artifacts which seem to be related to the windowing used by the model's image encoder (especially when using the large model).
Here's an example comparison of the masking results of the large model (top) vs. the tiny model (bottom) at 512px with a single foreground prompt (the mouse position):
You can see blocky edges in the large model mask whereas the tiny model seems fine. The large model does have a decent mask (second last, seen on the right), but the IoU score is always too low for it to be picked automatically.
It may be that for your use case, one of the other mask outputs is ok, in which case you might be able to hard-code the mask index (really depends on your use case). Alternatively, it's worth trying a different model size, especially avoiding the large model, which seems to have the most issues at non-1024 resolutions. Box prompts also seem to work better at low resolutions, if that's an option for you.
@heyoeyo , thanks for the detailed response. It was very helpful. I will try experimenting with smaller models and observe. I am using box prompts for my tests. Out of curiosity, how were you able to visualize the masks for the image (like in the image you shared)? Doing so would be very helpful for me to figure out which settings work for my use case.
EDIT: With regards to my changes, I can confirm the following. The code runs without issue, and I observe speed gains that aligns with my expectations. On my 3090, and working with an FHD video, I got ~30ms per frame on 512px and 3x that on 1024px at 90ms per frame. Regarding the quality of the output, I am working on a small clip of soccer, where my box prompt at the start is the ball.
how were you able to visualize the masks for the image (like in the image you shared)?
I was using the 'run image' script found here (running with the script argument -b 512
):
https://github.com/heyoeyo/muggled_sam?tab=readme-ov-file#run-image
(there's also a visualization script that allows the encoding size to be changed while observing the resulting mask)
swaps from the ball to the player's shoe and eventually starts tracking the player entirely
One thing that might help is to reduce the model's reliance on the 'recent frame' memory encodings. These seem to help keep track of changes to the object's appearance (compared to when you first prompt it), but can have the effect of 'absorbing' other things into the tracking (e.g. picking up the player foot, then their leg, then the whole player etc.).
One way to do this would be to provide more prompts (basically more heavily weighting the tracking towards prompted frames). Otherwise reducing the number of recent frame encodings could help. As far as I can tell, the only way to change this is to modify the existing code in the sam_base.py
file:
# Modify sam_base.py to use fewer 'recent frame' encodings when tracking
num_recent_frames_to_use = 2 # the original implementation uses 6 frames
for t_pos in range(1, 1 + num_recent_frames_to_use):
# Originally:
# for t_pos in range(1, self.num_maskmem):
That being said, if the ball is really small in the video, then it may be hard to keep the tracking working at lower resolutions, since it might end up too small to reliably 'see'.
There is a question that has been bothering me for a long time. Why does the segment anything 1 code set a fixed input image size of 1024*1024, but it is based on transformer and should support image input of any size?