facebookresearch / sam2

The repository provides code for running inference with the Meta Segment Anything Model 2 (SAM 2), links for downloading the trained model checkpoints, and example notebooks that show how to use the model.
Apache License 2.0
12.14k stars 1.1k forks source link

Slow inference speed on the 3090 Ti #436

Open chopin1998 opened 16 hours ago

chopin1998 commented 16 hours ago

i use SAM2AutomaticMaskGenerator for real time app, but it's too slow to use, over 1second per frame even tiny model.

or is there any more generator for realtime app? build_sam2_video_predictor need a path with image list...

Thank thee very much!

here is my env: Package Version Editable project location


antlr4-python3-runtime 4.9.3 asttokens 2.4.1 comm 0.2.2 contourpy 1.3.0 cycler 0.12.1 debugpy 1.8.7 decorator 5.1.1 einops 0.8.0 exceptiongroup 1.2.2 executing 2.1.0 filelock 3.16.1 flash-attn 2.6.3 fonttools 4.54.1 fsspec 2024.10.0 hydra-core 1.3.2 iopath 0.1.10 ipykernel 6.29.5 ipython 8.29.0 jedi 0.19.1 Jinja2 3.1.4 jupyter_client 8.6.3 jupyter_core 5.7.2 kiwisolver 1.4.7 MarkupSafe 3.0.2 matplotlib 3.9.2 matplotlib-inline 0.1.7 mpmath 1.3.0 nest-asyncio 1.6.0 networkx 3.4.2 numpy 2.1.2 nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu12 8.9.2.26 nvidia-cufft-cu12 11.0.2.54 nvidia-curand-cu12 10.3.2.106 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu12 12.1.0.106 nvidia-nccl-cu12 2.20.5 nvidia-nvjitlink-cu12 12.4.127 nvidia-nvtx-cu12 12.1.105 omegaconf 2.3.0 opencv-python 4.10.0.84 packaging 24.1 parso 0.8.4 pexpect 4.9.0 pillow 11.0.0 pip 22.0.2 platformdirs 4.3.6 portalocker 2.10.1 prompt_toolkit 3.0.48 psutil 6.1.0 ptyprocess 0.7.0 pure_eval 0.2.3 Pygments 2.18.0 pyparsing 3.2.0 python-dateutil 2.9.0.post0 PyYAML 6.0.2 pyzmq 26.2.0 SAM-2 1.0 /home/marco/Build/sam2 setuptools 59.6.0 six 1.16.0 stack-data 0.6.3 sympy 1.13.1 torch 2.3.1 torchvision 0.18.1 tornado 6.4.1 tqdm 4.66.6 traitlets 5.14.3 triton 2.3.1 typing_extensions 4.12.2 wcwidth 0.2.13 wheel 0.44.0

and sample code like: image

run result: image

heyoeyo commented 8 hours ago

That card should be able to run the (tiny) image encoder in 5-10ms and the prompt/mask decoder in around 1-1.5ms. The auto mask generator encodes the input image once and then uses a 64x64 grid of point prompts (by default), so a total of 1024 prompts are needed. Worst case, that should give a run time (per frame) of around:

Total inference time = 1*(10ms per frame) + 1024*(1.5ms per prompt) = 1546ms

That seems consistent with your timings, so most likely things are running as expected (you could maybe try compiling the model and/or adding the environment variable that the 'scaled_dot_product_attention' warnings mention to get a bit of a speed up). The only way to get a big speed up would be to use fewer prompt points, which can be adjusted when setting up the mask generator:

mask_generator = SAM2AutomaticMaskGenerator(sam2, points_per_side=4)

Using a very small number (like 4) should get it close to real-time speeds, but the trade-off is that it will likely miss segmentations compared to the default 64x64 grid.