VinAIResearch / Open3DIS

Open3DIS: Open-vocabulary 3D Instance Segmentation with 2D Mask Guidance (CVPR 2024)
https://open3dis.github.io/
Apache License 2.0
75 stars 3 forks source link

Inconsistent number of frames in the given validation scannet200 scene and rendered scene. #15

Closed Yebulabula closed 5 months ago

Yebulabula commented 5 months ago

Dear author,

Sorry to bother you again. I am writing to seek clarification about the frame count in the validation ScanNet200 example, which totals 475 frames. Following the standard ScanNet rendering process in "https://github.com/ScanNet/ScanNet/tree/master/SensReader/python", I generated RGB-D images from a .sens file, but ended up with a significantly higher frame count. Could you explain the discrepancy?

Thank you for your assistance.

Best regards, Ye Mao

PhucNDA commented 5 months ago

Hi @Yebulabula, We capture RGB-D frames at intervals of 5, indicating that we record and append the results to the self.frames list every 5 invocations of this class constructor: here

Best.

Yebulabula commented 5 months ago

Thanks for your quick reply. It is really helpful. Another question is that when I perform promptable segmentation, what is the most effective method to filter unnecessary 3D proposals? I found Open3DIS normally generates a bunch of masks for a single text prompt, but we only need one finally. I tried to select the 3D proposal with the highest CLIP confidence as the final mask. But this confidence value is not always reliable. Do you have some advice on it? Thanks.

Yebulabula commented 5 months ago

Additionally, I am confused about when we conduct promptable segmentation, do we really need CLIP for further class label assignment? Why don't we merge multiple 2D masks into a single 3D proposal, and use it as the final segmentation result?

PhucNDA commented 5 months ago

Hi @Yebulabula,

A1: For post-processing techniques, you can explore various NMS algorithms at ISBNet, filtering techniques at OVIR-3D, DBScan at Segment3D and many other heuristic algorithms...

A2: Certainly, you can utilize the lifted 2D masks from 2Dsegmenter as the final result for promptable segmentation. However, 2D masks derived from 2Dsegmenter are typically noisy (as can be seen by enabling this script), which might lead to unreliable results. For instance, when querying "Hoverboard" in a 3D scene, certain views may show the 2Dsegmenter incorrectly segmenting a chair, resulting in an inaccurate final result (in green version). Conversely, our method incrementally refines CLIP features to filter out false predictions, as demonstrated in the red version. This approach, using CLIP features, is developed by our VinAI-3DIS team OpenSUN3D

image image image If you have any question, feel free to let me know.

Best.

PhucNDA commented 5 months ago

If you have any question, feel free to re-open the issue