Junyi42 / monst3r

Official Implementation of paper "MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion"
https://monst3r-project.github.io/
512 stars 12 forks source link

About the background of demo #3

Open wdkkkk opened 6 days ago

wdkkkk commented 6 days ago

Hi, thanks for sharing your awesome work. I wonder if the shape of Global Point Cloud $\hat{X}$ is $H\times W \times 3 \times T$, and if you only render $\hat{X}_t$ to obtain the rendered frame at timestamp $t$? If that’s the case, I think the background should change at different timestamps based on the camera's visible area (i.e. the invisible areas in the input video frame $t$ will not appear in the rendered result of frame $t$.). But in the demo on your website, the background appears unchanged across timestamps. Therefore, I would like to know if you handled the foreground and background differently?

Junyi42 commented 5 days ago

Hi @wdkkkk,

Thank you for your interest in our work!

Yes, we handle the foreground and background differently in visualization. For the background, we visualize the overlapping point clouds across the entire sequence to provide a consistent view. For the foreground, we visualize only the point cloud at the corresponding timestamp, so that it changes over time.

The fg/bg masks can be derived from ground truth mask or obtained as the motion mask from our method. Additionally, alternatives like using SAM2 to generate the mask are also feasible options.

Best.

littlepure2333 commented 5 days ago

Hi @Junyi42, thanks for your great work! So according to your explanation, all the demo videos are using motion mask from the method? Or part of them are using GT mask?

Junyi42 commented 4 days ago

Hi @Junyi42, thanks for your great work! So according to your explanation, all the demo videos are using motion mask from the method? Or part of them are using GT mask?

Hi @littlepure2333,

Thanks! We use the GT mask for the joint dense reconstruction & pose estimation, for a fair comparison with prior work. The motion mask extracted from our method could occasionally be noisy. The quality of SAM2's mask (that could be prompted from Monst3r's motion mask or simply click) would be similar to the GT mask.

Best