Pointcept / OpenIns3D

[ECCV'24] OpenIns3D: Snap and Lookup for 3D Open-vocabulary Instance Segmentation
MIT License
127 stars 9 forks source link

About integration with LISA #16

Closed xjj1999 closed 2 weeks ago

xjj1999 commented 2 weeks ago

The paper mentions that when integrated with LISA, OpenIns3D can perform reasoning segmentation in 3D. i'm curious and would like to reproduce this part of the work, but haven't found any relevant code. And another question is whether the authors have tried any other llm with output mask capability.

ZheningHuang commented 2 weeks ago

Hi,

Thank you for your question.

We found that while LISA performs well for complex reasoning, it has some limitations in terms of robustness, as it tends to produce a large number of false positives and requires an exceptionally large pre-trained model to run effectively. Due to these factors, we decided not to integrate it into our baseline.

Instead, we opted to use YOLOworld, which has proven to be quite effective for handling complex queries. You can find the implementation here. This should be able to produce results similar to those shown in the teaser image.

Regarding your question about using other LLMs with output mask capabilities, I'm not entirely sure which model you're referring to. However, as far as I know, Segment3D could be a promising option for mask proposals, although their code has not yet been released. I would be very interested to learn about any other methods for mask proposals that have shown good results.

Thanks, Zhening

xjj1999 commented 2 weeks ago

Thanks for the reply! I have one more detail to confirm, you mentioned “a large number of false positives” for intricate language? Does it still maintain decent performance for simple text?

ZheningHuang commented 2 weeks ago

Hi,

The false positive issue of LISA arises because we need to run LISA on many images of the scene. As you can imagine, many of these images do not contain the target object, yet LISA still produces some predictions [this issue is better with other detectors like YOLOWOLRD]. However, in our tests, LISA performed quite well on images that do contain the object of interest.

One extra note: This FP issue is also solvable, as we have incorporated a CLIP-based ranking and filtering design to effectively reduce FP, in our implementation https://github.com/Pointcept/OpenIns3D/blob/5e9ada0d2610fb610b67dcdcd5e1216f2b100dad/openins3d/build_lookup_yoloworld.py#L101.

Best, Zhening

xjj1999 commented 2 weeks ago

Thanks for the reply! Based on your description, integrating LLM might provide a nice improvement. I'm interested in integrating LISA and would like to reproduce it on your code. Could you please share the related code? Thank you again for the outstanding work

ZheningHuang commented 2 weeks ago

Hi,

If you plan to use LISA as a detector, the first step is to familiarize yourself with the LISA codebase. Once LISA is functioning, you'll need to create a new file to use LISA to build the lookup table, which involves converting its output into a list of bounding boxes (BBOX) and labels, similar to the format shown here. (Once LISA is up and running, I don't anticipate this to be much additional work.)

However, as I mentioned earlier, we currently don't have a streamlined version of LISA that functions as a detector in this newly tidied-up codebase.

Zhening