facebookresearch / sam2

The repository provides code for running inference with the Meta Segment Anything Model 2 (SAM 2), links for downloading the trained model checkpoints, and example notebooks that show how to use the model.
Apache License 2.0
12.39k stars 1.14k forks source link

Frames maxium when inference offline #10

Open lucasjinreal opened 3 months ago

lucasjinreal commented 3 months ago

The input resolution is 1024, how much roughly frames it can handles one time consider the GPU mem usage and speed?

xxxxyliu commented 3 months ago

I want to change the resolution from 1024 to 768, but I encountered a size mismatch issue during inference.

heyoeyo commented 3 months ago

change the resolution from 1024 to 768

There is some info on how to do this in issue #138

how much roughly frames it can handles one time consider the GPU mem usage and speed?

Inside the video processing notebook, section: Step 3: Propagate the prompts to get the masklet across the video contains code that actually runs predictions on every frame. There (inside the 'predictor.propagate_in_video' for loop) you can place the following line of code to print out the VRAM usage per-frame as it's processing:

print("VRAM (MB):", torch.cuda.max_memory_allocated() // 1_000_000)

When I run this, it seems to use 1 MB per frame.

However, there is a potentially very large amount of memory needed to initially load all the images (in the predictor.init_state step) before processing even begins. Though there are settings that can help reduce or offload the memory to system RAM.

xxxxyliu commented 3 months ago

change the resolution from 1024 to 768

There is some info on how to do this in issue #138

how much roughly frames it can handles one time consider the GPU mem usage and speed?

Inside the video processing notebook, section: Step 3: Propagate the prompts to get the masklet across the video contains code that actually runs predictions on every frame. There (inside the 'predictor.propagate_in_video' for loop) you can place the following line of code to print out the VRAM usage per-frame as it's processing:

print("VRAM (MB):", torch.cuda.max_memory_allocated() // 1_000_000)

When I run this, it seems to use 1 MB per frame.

However, there is a potentially very large amount of memory needed to initially load all the images (in the predictor.init_state step) before processing even begins. Though there are settings that can help reduce or offload the memory to system RAM.

Thank you for your assistance. When reasoning with the MOSE dataset, I found that the result reported in the paper is 79, while in the README it is 77 with compile_image_encoder: True. However, the result I obtained from inference is not as high. Moreover, the result with compile_image_encoder: True is even lower than when it is set to false.

heyoeyo commented 3 months ago

result reported in the paper is 79, while in the README it is 77

Ya that's a bit confusing. Assuming you're referring to Model Description table in the README, it looks like that comes from Table 17 (d) (page 30) of the paper. There they list the large model twice, with scores of 77.2 and 74.6 for two variants of the model. So it seems like there may be multiple versions of each size and maybe that accounts for the differences in scoring? I'm not really sure though.

xxxxyliu commented 3 months ago

result reported in the paper is 79, while in the README it is 77

Ya that's a bit confusing. Assuming you're referring to Model Description table in the README, it looks like that comes from Table 17 (d) (page 30) of the paper. There they list the large model twice, with scores of 77.2 and 74.6 for two variants of the model. So it seems like there may be multiple versions of each size and maybe that accounts for the differences in scoring? I'm not really sure though.

Thank you for your response. I will take another careful look at it.