facebookresearch / segment-anything-2

The repository provides code for running inference with the Meta Segment Anything Model 2 (SAM 2), links for downloading the trained model checkpoints, and example notebooks that show how to use the model.
Apache License 2.0
10.68k stars 857 forks source link

VRAM requirements/benchmarks? #118

Open Zarxrax opened 1 month ago

Zarxrax commented 1 month ago

Is there any data available on how much VRAM is required to run the model at various resolutions? Does the requirement increase depending on the length of the video? I did not find any information about this in the paper.

heyoeyo commented 1 month ago

I haven't gotten to the video stuff yet, but here are some values for peak VRAM usage (from torch.cuda.max_memory_allocated()) when processing a single image for 3 of the models:

Tiny model (small model is very similar) Resolution Peak VRAM (MB)
576x1024 240
1024x1024 458 Tiny
2048x2048 1158 Tiny
Base-plus model Resolution Peak VRAM (MB)
576x1024 328
1024x1024 551
2048x2048 1275
Large model Resolution Peak VRAM (MB)
576x1024 624
1024x1024 855
2048x2048 1626

This is all running with float16 and without any of the fancy attention optimizations.

I haven't checked out the video processing yet, but I'd expect that the image encoder makes up the bulk of the VRAM usage, so I'd be surprised if the memory requirements were much higher (unless processing videos accumulates data over time, I'm not sure!).

Zarxrax commented 1 month ago

Thanks, that doesn't look bad at all for single image processing.

I do think on video it will probably use significantly more VRAM though (that's what happens on other video segmentation models), though it would be awesome if they have made progress on that front.

heyoeyo commented 1 month ago

Yes I think you're right. Having a glance at the video code, it looks like the entire frame sequence along with per-frame image encodings are cached, among other things.

It looks like it is possible to offload some of the memory usage to regular RAM, but this doesn't seem to apply to the image encoding, and regardless it's still a build-up... may be a problem for long videos!