BAAI-DCAI / SpatialBot

The official repo for "SpatialBot: Precise Spatial Understanding with Vision Language Models.
MIT License
119 stars 9 forks source link

Discuss arxiv pdf #1

Closed swearos closed 1 month ago

swearos commented 1 month ago

The new version of your arxiv pdf (July 16) has a major change in the chart that seems to introduce some ambiguous descriptions For example, in table 1: “LLM-RGB and LLM-RGBD are trained on RGB images only and tested with RGB and RGBD inputs,” But the training dataset of SpatialBot series models seems to contain RGBD data. image image

There are also questions to consult:

  1. How is RGBD input for the bunny model in the evaluation set?
  2. In SpatialBench results, bunny model has depth-related scores, why does GPT-4o not have corresponding results?
  3. In the experimental results of Table 1 in SpatialBench, it appears that SpatialBot, aside from the depth dimension, does not seem to demonstrate significant benefits from the Depth map in other aspects such as position and existence. Can this be interpreted correctly?

Look forward to your reply very much, thank you!

RussRobin commented 1 month ago

Hi @swearos , thank you for your interest in SpatialBot.

(a) Names explained in Table 1. Sorry for the confusion in paper,

  1. Bunny-Phi2-3B-RGB is trained on RGB only (bunny 695k), and tested with RGB images
  2. Bunny-Phi2-3B-RGBD is trained on RGB only (bunny 695k), so it is the same model with Phi-2-3B-RGBD, but it is tested on RGBD.
  3. SpatialBot-Phi2-3B-RGB is trained on SpatialQA, with RGB & RGBD images, and tested with RGB images only.
  4. SpatialBot-Phi2-3B-RGBD is trained on SpatialQA, with RGB & RGBD images, and tested with RGB-Depth images.

(b) RGBD evaluating: We prepare RGB and Depth maps, and feed them into model as 2 image inputs. SpatialBot, and Bunny implemented here, both support multi image inputs. You may want to refer to RGB/RGBD evaluation codes, where we provide RGB/RGBD evaluation for SpatialBench, MME and GQA.

(c) Why not report depth scores of GPT-4o: TLDR: it behaves badly. We are not sure, whether GPT-4o doesn't have the abality to do monocular depth estimation, or the prompt we use is not good. We are trying different prompts for GPT-4o, and will consider releasing depth scores for it in future versions. For fair comparasion, we do not report it for now.

(d) Can SpatialBot benefit from RGB-Depth, compared to RGB, in position, existence, ... ?

  1. Can model benefit from RGBD in evaluation, compared to RGB? Scores of SpatialBot-RGBD is not significantly higher than SpatialBot-RGB in Table 1 and Table 2. 1.1 For Table 1: Sorry but i'm unable to answer this question now. In fact, apart from depth and proximity, we havev't found any questions, that can not be answered without depth info. 1.2 For Table 2: it seems that questions in general MLLM benchmarks can be answered with RGB only. (e.g. give model a art work and ask who drew it.)
  2. Can model benefit from RGBD, depth information, and depth-related QAs in training? See Table. 2, definitely yes. BTW, in near future, we'll release much more powerful models, by combining SpatialQA high-level QAs with bunny695k, to demo that SpatialQA really helps.

Hope it makes sense. Feel free to reach out if you have further questions.

Regards

swearos commented 1 month ago

Thank you for your work and your patient response!

I have one more question about the SpatialBench dataset: How do you build the QA data for depth without publishing the depth evaluation task?

https://huggingface.co/datasets/RussRobin/SpatialBench/tree/main

RussRobin commented 1 month ago

In SpatialBench HF, we release annotated QAs for high-level tasks (see paper Fig. 2).

Depth and proximity QAs are not released. They are in almost the same format with SpatialQA, which will be released soon. For depth, we ask the depth of points, and depth of objects. For proximity, we ask the model to compare the depth of two points, or two objects. It simply goes like: What is the depth value of point <0.55,0.73> / What is the depth value of object: a basketball?

RussRobin commented 1 month ago

I'll close this issue since no further discussion was raised for a week. Feel free to reopen it if you still have questions in our paper.