Slow inference speed on the 3090 Ti

i use SAM2AutomaticMaskGenerator for real time app, but it's too slow to use, over 1second per frame even tiny model.

or is there any more generator for realtime app? build_sam2_video_predictor need a path with image list...

Thank thee very much!

here is my env: Package Version Editable project location

antlr4-python3-runtime 4.9.3 asttokens 2.4.1 comm 0.2.2 contourpy 1.3.0 cycler 0.12.1 debugpy 1.8.7 decorator 5.1.1 einops 0.8.0 exceptiongroup 1.2.2 executing 2.1.0 filelock 3.16.1 flash-attn 2.6.3 fonttools 4.54.1 fsspec 2024.10.0 hydra-core 1.3.2 iopath 0.1.10 ipykernel 6.29.5 ipython 8.29.0 jedi 0.19.1 Jinja2 3.1.4 jupyter_client 8.6.3 jupyter_core 5.7.2 kiwisolver 1.4.7 MarkupSafe 3.0.2 matplotlib 3.9.2 matplotlib-inline 0.1.7 mpmath 1.3.0 nest-asyncio 1.6.0 networkx 3.4.2 numpy 2.1.2 nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu12 8.9.2.26 nvidia-cufft-cu12 11.0.2.54 nvidia-curand-cu12 10.3.2.106 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu12 12.1.0.106 nvidia-nccl-cu12 2.20.5 nvidia-nvjitlink-cu12 12.4.127 nvidia-nvtx-cu12 12.1.105 omegaconf 2.3.0 opencv-python 4.10.0.84 packaging 24.1 parso 0.8.4 pexpect 4.9.0 pillow 11.0.0 pip 22.0.2 platformdirs 4.3.6 portalocker 2.10.1 prompt_toolkit 3.0.48 psutil 6.1.0 ptyprocess 0.7.0 pure_eval 0.2.3 Pygments 2.18.0 pyparsing 3.2.0 python-dateutil 2.9.0.post0 PyYAML 6.0.2 pyzmq 26.2.0 SAM-2 1.0 /home/marco/Build/sam2 setuptools 59.6.0 six 1.16.0 stack-data 0.6.3 sympy 1.13.1 torch 2.3.1 torchvision 0.18.1 tornado 6.4.1 tqdm 4.66.6 traitlets 5.14.3 triton 2.3.1 typing_extensions 4.12.2 wcwidth 0.2.13 wheel 0.44.0

and sample code like:

run result:

That card should be able to run the (tiny) image encoder in 5-10ms and the prompt/mask decoder in around 1-1.5ms. The auto mask generator encodes the input image once and then uses a 64x64 grid of point prompts (by default), so a total of 1024 prompts are needed. Worst case, that should give a run time (per frame) of around:

Total inference time = 1*(10ms per frame) + 1024*(1.5ms per prompt) = 1546ms

That seems consistent with your timings, so most likely things are running as expected (you could maybe try compiling the model and/or adding the environment variable that the 'scaled_dot_product_attention' warnings mention to get a bit of a speed up). The only way to get a big speed up would be to use fewer prompt points, which can be adjusted when setting up the mask generator:

mask_generator = SAM2AutomaticMaskGenerator(sam2, points_per_side=4)

Using a very small number (like 4) should get it close to real-time speeds, but the trade-off is that it will likely miss segmentations compared to the default 64x64 grid.

facebookresearch / sam2

Slow inference speed on the 3090 Ti #436