VRAM requirements/benchmarks?

Zarxrax commented 1 month ago

Is there any data available on how much VRAM is required to run the model at various resolutions? Does the requirement increase depending on the length of the video? I did not find any information about this in the paper.

heyoeyo commented 1 month ago

I haven't gotten to the video stuff yet, but here are some values for peak VRAM usage (from torch.cuda.max_memory_allocated()) when processing a single image for 3 of the models:

Tiny model (small model is very similar)	Resolution	Peak VRAM (MB)
576x1024	240
1024x1024	458	Tiny
2048x2048	1158	Tiny

Base-plus model	Resolution	Peak VRAM (MB)
576x1024	328
1024x1024	551
2048x2048	1275

Large model	Resolution	Peak VRAM (MB)
576x1024	624
1024x1024	855
2048x2048	1626

This is all running with float16 and without any of the fancy attention optimizations.

I haven't checked out the video processing yet, but I'd expect that the image encoder makes up the bulk of the VRAM usage, so I'd be surprised if the memory requirements were much higher (unless processing videos accumulates data over time, I'm not sure!).

Zarxrax commented 1 month ago

Thanks, that doesn't look bad at all for single image processing.

I do think on video it will probably use significantly more VRAM though (that's what happens on other video segmentation models), though it would be awesome if they have made progress on that front.

heyoeyo commented 1 month ago

Yes I think you're right. Having a glance at the video code, it looks like the entire frame sequence along with per-frame image encodings are cached, among other things.

It looks like it is possible to offload some of the memory usage to regular RAM, but this doesn't seem to apply to the image encoding, and regardless it's still a build-up... may be a problem for long videos!

facebookresearch / segment-anything-2

VRAM requirements/benchmarks? #118