Closed Yebulabula closed 5 months ago
Hi @Yebulabula, We capture RGB-D frames at intervals of 5, indicating that we record and append the results to the self.frames list every 5 invocations of this class constructor: here
Best.
Thanks for your quick reply. It is really helpful. Another question is that when I perform promptable segmentation, what is the most effective method to filter unnecessary 3D proposals? I found Open3DIS normally generates a bunch of masks for a single text prompt, but we only need one finally. I tried to select the 3D proposal with the highest CLIP confidence as the final mask. But this confidence value is not always reliable. Do you have some advice on it? Thanks.
Additionally, I am confused about when we conduct promptable segmentation, do we really need CLIP for further class label assignment? Why don't we merge multiple 2D masks into a single 3D proposal, and use it as the final segmentation result?
Hi @Yebulabula,
A1: For post-processing techniques, you can explore various NMS algorithms at ISBNet, filtering techniques at OVIR-3D, DBScan at Segment3D and many other heuristic algorithms...
A2: Certainly, you can utilize the lifted 2D masks from 2Dsegmenter as the final result for promptable segmentation. However, 2D masks derived from 2Dsegmenter are typically noisy (as can be seen by enabling this script), which might lead to unreliable results. For instance, when querying "Hoverboard" in a 3D scene, certain views may show the 2Dsegmenter incorrectly segmenting a chair, resulting in an inaccurate final result (in green version). Conversely, our method incrementally refines CLIP features to filter out false predictions, as demonstrated in the red version. This approach, using CLIP features, is developed by our VinAI-3DIS team OpenSUN3D
If you have any question, feel free to let me know.
Best.
If you have any question, feel free to re-open the issue
Dear author,
Sorry to bother you again. I am writing to seek clarification about the frame count in the validation ScanNet200 example, which totals 475 frames. Following the standard ScanNet rendering process in "https://github.com/ScanNet/ScanNet/tree/master/SensReader/python", I generated RGB-D images from a .sens file, but ended up with a significantly higher frame count. Could you explain the discrepancy?
Thank you for your assistance.
Best regards, Ye Mao