ZzZZCHS / Chat-Scene

Code for "Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers" (NeurIPS 2024)
MIT License
113 stars 8 forks source link

about table 3 #14

Closed iris0329 closed 7 months ago

iris0329 commented 9 months ago

Hi, thanks for providing this work.

image

  1. I found there is no discussion for table 3 in the paper.
  2. I found the acc@0.25 on the ScanRefer dataset is 35.9, but I didn't find out this number on the ScanRefer official site.
  3. I just wonder why the results seem to not be compared with the ScanRefer baseline method.

Did I miss anything?

ZzZZCHS commented 9 months ago
  1. Table 3 aims to show that our method reaches SOTA among 3D LLM methods.
  2. We haven't uploaded the results to the official site, but we have released the entire training and evaluation pipeline (including the checkpoints) in this repository. You can try it out to check the performance.
  3. I think the main reason is the lack of data for 3D-LLM alignment. For the success of 2D-LLMs, they utilized a large amount of training data for alignment. Also, the intricate nature of 3D scenes necessitates a more tailored design to learn spatial relationships effectively. In this work, we use a simple alignment architecture and only use data sourced from scannet, which is far from enough to train a robust 3D LLM. Future works are needed to solve these problems.
iris0329 commented 9 months ago

Ah, I see. Thanks for your detailed reply and open-source, and I tried the code.

I still would like to know, in the ScanRefer benchmark image

but Table 3 shows the different numbers for the same method, e.g. ScanRefer image

I am kind of confused in reading the table as to why this difference will happen

Would you share with me how to match the two tables?

ZzZZCHS commented 9 months ago

The ScanRefer Benchmark is based on ScanRefer's test set, while the results in our Table 3 is on validation set.

We use the results from ViL3DRel's Table 8. You can also find the same results (37.3/24.3) in InstanceRefer (Table 1) and MVT (Table 3).