facebookresearch / sam2

The repository provides code for running inference with the Meta Segment Anything Model 2 (SAM 2), links for downloading the trained model checkpoints, and example notebooks that show how to use the model.
Apache License 2.0
11.65k stars 1.02k forks source link

Achieving Higher FPS with Multiple Object Tracking #367

Open daniaFrenel opened 1 week ago

daniaFrenel commented 1 week ago

Hey, I am using SAM2 to analyze videos recorded at 30 FPS and currently tracking around 16 objects, achieving a tracking speed only of 2 FPS (propagating every 4 frames). I am interested in understanding the factors impacting this performance, for example the number of frames loaded for the video predictor.

Could you provide insights on optimizations or adjustments that might help improve performance? For example, would loading object IDs directly into tensors enhance processing speed? Any guidance on potential changes to reach my desired FPS would be greatly appreciated.

Dania

heyoeyo commented 6 days ago

There are a few changes that can be made to speed things up, but they'll generally come at the cost of accuracy. The time required per frame is (roughly) something like:

time per frame = E + M*n

Where:
  E is the image encoding time
  M is the masking + memory encoding + memory attention time
  n is the number of objects being tracked

The image encoding time can be decreased by switching to smaller models (e.g. using the tiny model) as well as running at a lower image resolution (see issue #257). Both of these changes can reduce segmentation quality/accuracy though. The time required to load the image could also be considered part of this timing and could be reduced by loading images in parallel to running the model itself, though it should be a relatively small part of the total time either way.

The masking/memory time can be decreased by using fewer previous frames in the memory attention step as well as using a lower image resolution. Again, these changes can reduce the quality of the outputs. Using fewer memory frames requires changes to the code unfortunately, if you wanted to try it a simple hack is to edit line 539 in sam_base.py:

num_prev_frames = 1  # Values between 0 and 6 are valid
for t_pos in range(1, 1 + num_prev_frames): #self.num_maskmem):

It's very situational, but it some scenes it might also be possible to decrease the number of objects by using a prompt that masks several objects together in a single mask, though you would have to separate the results after-the-fact.

haithamkhedr commented 3 days ago

Hi @daniaFrenel, you can also try setting compile_image_encoder: True in the model config to increase the inference speed.