Clarification on Testing Results for SpatialBot-3B Model

robotlab2024 commented 2 months ago

Hi @RussRobin,

Thank you for your great work! I found it very interesting and decided to test the models using SpatialBench. However, I noticed that the performance of your proposed model, SpatialBot-3B, shows almost no improvement compared to Bunny-v1_0-3B in my tests, which doesn’t seem to align with the results in TABLE I of your paper.

I’m wondering if I might have selected the wrong version of the Bunny model, or if there is something I’m missing in the evaluation process.

Could you kindly provide some clarification?

Looking forward to your response!

Best regards

RussRobin commented 2 months ago

Hi @robotlab2024 , thank you for your interest in our work.

For Table 1 in paper: I trained Bunny with Bunny695k, in multi-image version so it can be tested with RGBD (the data is essentially the same with Bunny v1.0 & Bunny695k). I just tested with Bunny v1.0 3B:

Bunny RGB
positional get  17  out of  34   50.0 %
existence get  15  out of  20   75.0 %
counting get  92.41071428571429  out of 100.
reach get 20  out of  60   33.333333333333336 %
size, get 15  out of  60   25.0 %

The Bunny results are almost the same with my version of Bunny (as in Tabel 1), and reach is even lower.

I also tested with the newest version of SpatialBot-3B:

SpatialBot RGBD newest
positional, get  17  out of  34   50.0 %
existence get  13  out of  20   65.0 %
counting get  87.41071428571429  out of 100.
reach, get 30  out of  60   50.0 %
size, get 16  out of  60   26.666666666666668 %

SpatialBot RGB newest
positional get  21  out of  34   61.76470588235294 %
existence get  16  out of  20   80.0 %
counting get  87.41071428571429  out of 100.
reach, get 33  out of  60   55.0 %
size, get 14  out of  60   23.333333333333332 %

And an old version of SpatialBot, which you can download with:huggingface-cli download --resume-download RussRobin/SpatialBot-3B --local-dir … --revision 0f9fb9330440c262f68da699524236df6f8ebef7:

SpatialBot RGBD old version: 
positional get  20  out of  34   58.8235294117647 %
existence get  12  out of  20   60.0 %
counting get  93.125  out of 100.
reach, get 30  out of  60   50.0 %
size, get 19  out of  60   31.666666666666668 %

SpatialBot RGB old version: 
positional: get  21  out of  34   61.76470588235294 %
existence get  15  out of  20   75.0 %
counting get  92.41071428571429  out of 100.
reach, get 31  out of  60   51.666666666666664 %
size, get 17  out of  60   28.333333333333332 %

The test code i'm using:

#!/bin/bash
MODEL_TYPE=phi-2
python -m bunny.eval.eval_spatialbench \
    --model-path ... \
    --model-type $MODEL_TYPE\
    --data-path ./eval/spatial_bench \
    --conv-mode bunny \
    --question size.json \
    --depth # or comment out for RGB input

I believe the results are consistent with the paper, with a little bit fluctuation. And there are improvements of SpatialBot in comparasion to Bunny. Please share your test results and codes and let's dive in it deeper. Again, thanks a lot for your interest and experiments in SpatialBot.

ps. The eval code was changed 3 weeks ago in this commit and a small bug fix in this commit, please ensure that you are up to date.

RussRobin commented 2 months ago

Quick comparasion chart:

robotlab2024 commented 2 months ago

Hi @RussRobin ,

Thank you for your detailed response!

I noticed that in your test set, the number of questions for the four categories—existence, counting, reach, and size—are 20, 100, 60, and 60, respectively. However, in the publicly available SpatialBench dataset, the numbers for these categories are 40, 20, 40, and 40, respectively. I wonder if this discrepancy might be the cause of the issue.

RussRobin commented 2 months ago

The number of images is not number of questions/scores. You may want to refer to arxiv paper section 7 in supplementary:

‘ Only when the model answers both the positive and negative questions of a problem correctly is it considered correct’. So this would be: 1 image, 2 questions, 1 point.
‘ We first calculate the rate of correct choices from models. When it answers a pair of positive and negative questions correctly, we give it a bonus score.’ In this case, 1 image, 2 questions, 3 points.

sorry that this evaluation is not very clearly explained in paper. But the official benchmark on hf and eval code in this repo should make it clear. In tonight’s testing, I downloaded benchmark and models from hf directly to test, so it should be the same with yours.

RussRobin commented 2 months ago

Existence: get one point if answering pos-neg question pair correct. (1pt if and only if pos correct and neg correct). This is a common criteria in existence VQAs.

Count: numerical, so it’s accuracy in %.

Reach and Size: has bonus points if answering pos-neg pair correct. (Only one correct: 1pt, both correct:1+1+1 bonus=3). We kind of follow MME in this bonus point criteria.

robotlab2024 commented 2 months ago

Thank you very much for your response. I sincerely apologize for not reading the model evaluation section of the paper carefully. I really appreciate you taking the time to address my questions.

Hope you have a good night, @RussRobin.

RussRobin commented 2 months ago

No worries. Great thanks for your interest in our work! I will close this issue then.

BAAI-DCAI / SpatialBot

Clarification on Testing Results for SpatialBot-3B Model #13