Junyi42 / monst3r

Official Implementation of paper "MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion"
https://monst3r-project.github.io/
845 stars 43 forks source link

How to increase infering speed? #44

Open kszpxxzmc opened 2 weeks ago

kszpxxzmc commented 2 weeks ago

Thanks for your nice work! I have a confusion regarding inference speed. In your paper, you claim that the inference time of Monst3r on the A6000 is about 90 seconds. I conducted a practical test with 94 images on the A100 and found that the inference time for the whole process is more than 1 hour. I want to know why it is so slow and how I can improve inference speed, even by sacrificing some video memory. 1731509344688

Junyi42 commented 2 weeks ago

Hi @kszpxxzmc,

Thanks for the feedback. As far as I see, most of the latency is due to initialization of the dynamic mask (here). Since this process runs on CPU, it could vary greatly across hardware. One simple way is to turn off the flow loss for the optimization (by adding --flow_loss_weight=0.0) though it may degrade the performance. You could also try to use the mask from SAM2 model for this motion mask initialization by parsing the SAM2 mask to the self.dynamic_masks.

I also noticed that the latency of feed-forward inference (5:37 for 890 pairs) is unusual. Based on my experience (and reports from other users, e.g., https://github.com/Junyi42/monst3r/issues/10#issue-2603031409), this should be done in less than one minute. You could probably try to set larger batchsize in the demo.py. Hope this helps!

Best.

huddyyeo commented 2 weeks ago

thanks for your help on making it faster! could you comment on why you used sam2 to refine the mask, and not simply to init the mask instead? is it better that way?

Junyi42 commented 2 weeks ago

thanks for your help on making it faster! could you comment on why you used sam2 to refine the mask, and not simply to init the mask instead? is it better that way?

Hi @huddyyeo,

Because SAM2 requires a prompt as input (point / box / mask), and we use our initialized mask as the prompt for SAM2 to refine. You could definitely use "click" to get the SAM2 mask for initialization, though this will not be a fully automated way. Another possible way is to use off-the-shelf motion segmentation method (e.g., https://github.com/TonyLianLong/RCF-UnsupVideoSeg) to get the initialized mask, or even use it as the prompt for SAM2.

Thanks.

huddyyeo commented 2 weeks ago

thanks @Junyi42 for the quick reply 🙏 just to clarify, then what did you mean by passing sam2 mask to self.dynamic_masks in here? since we cannot just init the mask via sam2

You could also try to use the mask from SAM2 model for this motion mask initialization by parsing the SAM2 mask to the self.dynamic_masks.

Junyi42 commented 2 weeks ago

thanks @Junyi42 for the quick reply 🙏 just to clarify, then what did you mean by passing sam2 mask to self.dynamic_masks in here? since we cannot just init the mask via sam2

You could also try to use the mask from SAM2 model for this motion mask initialization by parsing the SAM2 mask to the self.dynamic_masks.

Hi @huddyyeo,

Sorry for the confusion. What I meant is that if one already has a better motion segmentation mask (via "click" for SAM2 or off-the-shelf motion segmentation methods), then you can load the segmentation mask with variable self.dynamic_masks. Thanks.