Open lucasjinreal opened 3 months ago
I want to change the resolution from 1024 to 768, but I encountered a size mismatch issue during inference.
change the resolution from 1024 to 768
There is some info on how to do this in issue #138
how much roughly frames it can handles one time consider the GPU mem usage and speed?
Inside the video processing notebook, section: Step 3: Propagate the prompts to get the masklet across the video contains code that actually runs predictions on every frame. There (inside the 'predictor.propagate_in_video' for loop) you can place the following line of code to print out the VRAM usage per-frame as it's processing:
print("VRAM (MB):", torch.cuda.max_memory_allocated() // 1_000_000)
When I run this, it seems to use 1 MB per frame.
However, there is a potentially very large amount of memory needed to initially load all the images (in the predictor.init_state
step) before processing even begins. Though there are settings that can help reduce or offload the memory to system RAM.
change the resolution from 1024 to 768
There is some info on how to do this in issue #138
how much roughly frames it can handles one time consider the GPU mem usage and speed?
Inside the video processing notebook, section: Step 3: Propagate the prompts to get the masklet across the video contains code that actually runs predictions on every frame. There (inside the 'predictor.propagate_in_video' for loop) you can place the following line of code to print out the VRAM usage per-frame as it's processing:
print("VRAM (MB):", torch.cuda.max_memory_allocated() // 1_000_000)
When I run this, it seems to use 1 MB per frame.
However, there is a potentially very large amount of memory needed to initially load all the images (in the
predictor.init_state
step) before processing even begins. Though there are settings that can help reduce or offload the memory to system RAM.
Thank you for your assistance. When reasoning with the MOSE dataset, I found that the result reported in the paper is 79, while in the README it is 77 with compile_image_encoder: True
. However, the result I obtained from inference is not as high. Moreover, the result with compile_image_encoder: True
is even lower than when it is set to false.
result reported in the paper is 79, while in the README it is 77
Ya that's a bit confusing. Assuming you're referring to Model Description table in the README, it looks like that comes from Table 17 (d) (page 30) of the paper. There they list the large model twice, with scores of 77.2 and 74.6 for two variants of the model. So it seems like there may be multiple versions of each size and maybe that accounts for the differences in scoring? I'm not really sure though.
result reported in the paper is 79, while in the README it is 77
Ya that's a bit confusing. Assuming you're referring to Model Description table in the README, it looks like that comes from Table 17 (d) (page 30) of the paper. There they list the large model twice, with scores of 77.2 and 74.6 for two variants of the model. So it seems like there may be multiple versions of each size and maybe that accounts for the differences in scoring? I'm not really sure though.
Thank you for your response. I will take another careful look at it.
The input resolution is 1024, how much roughly frames it can handles one time consider the GPU mem usage and speed?