Closed robotlab2024 closed 2 months ago
Hi @robotlab2024 , thank you for your interest in our work.
For Table 1 in paper: I trained Bunny with Bunny695k, in multi-image version so it can be tested with RGBD (the data is essentially the same with Bunny v1.0 & Bunny695k). I just tested with Bunny v1.0 3B:
Bunny RGB
positional get 17 out of 34 50.0 %
existence get 15 out of 20 75.0 %
counting get 92.41071428571429 out of 100.
reach get 20 out of 60 33.333333333333336 %
size, get 15 out of 60 25.0 %
The Bunny results are almost the same with my version of Bunny (as in Tabel 1), and reach is even lower.
I also tested with the newest version of SpatialBot-3B:
SpatialBot RGBD newest
positional, get 17 out of 34 50.0 %
existence get 13 out of 20 65.0 %
counting get 87.41071428571429 out of 100.
reach, get 30 out of 60 50.0 %
size, get 16 out of 60 26.666666666666668 %
SpatialBot RGB newest
positional get 21 out of 34 61.76470588235294 %
existence get 16 out of 20 80.0 %
counting get 87.41071428571429 out of 100.
reach, get 33 out of 60 55.0 %
size, get 14 out of 60 23.333333333333332 %
And an old version of SpatialBot, which you can download with:huggingface-cli download --resume-download RussRobin/SpatialBot-3B --local-dir … --revision 0f9fb9330440c262f68da699524236df6f8ebef7
:
SpatialBot RGBD old version:
positional get 20 out of 34 58.8235294117647 %
existence get 12 out of 20 60.0 %
counting get 93.125 out of 100.
reach, get 30 out of 60 50.0 %
size, get 19 out of 60 31.666666666666668 %
SpatialBot RGB old version:
positional: get 21 out of 34 61.76470588235294 %
existence get 15 out of 20 75.0 %
counting get 92.41071428571429 out of 100.
reach, get 31 out of 60 51.666666666666664 %
size, get 17 out of 60 28.333333333333332 %
The test code i'm using:
#!/bin/bash
MODEL_TYPE=phi-2
python -m bunny.eval.eval_spatialbench \
--model-path ... \
--model-type $MODEL_TYPE\
--data-path ./eval/spatial_bench \
--conv-mode bunny \
--question size.json \
--depth # or comment out for RGB input
I believe the results are consistent with the paper, with a little bit fluctuation. And there are improvements of SpatialBot in comparasion to Bunny. Please share your test results and codes and let's dive in it deeper. Again, thanks a lot for your interest and experiments in SpatialBot.
ps. The eval code was changed 3 weeks ago in this commit and a small bug fix in this commit, please ensure that you are up to date.
Quick comparasion chart:
Hi @RussRobin ,
Thank you for your detailed response!
I noticed that in your test set, the number of questions for the four categories—existence, counting, reach, and size—are 20, 100, 60, and 60, respectively. However, in the publicly available SpatialBench dataset, the numbers for these categories are 40, 20, 40, and 40, respectively. I wonder if this discrepancy might be the cause of the issue.
The number of images is not number of questions/scores. You may want to refer to arxiv paper section 7 in supplementary:
‘ Only when the model answers both the positive and negative questions of a problem correctly is it considered correct’. So this would be: 1 image, 2 questions, 1 point.
‘ We first calculate the rate of correct choices from models. When it answers a pair of positive and negative questions correctly, we give it a bonus score.’ In this case, 1 image, 2 questions, 3 points.
sorry that this evaluation is not very clearly explained in paper. But the official benchmark on hf and eval code in this repo should make it clear. In tonight’s testing, I downloaded benchmark and models from hf directly to test, so it should be the same with yours.
Existence: get one point if answering pos-neg question pair correct. (1pt if and only if pos correct and neg correct). This is a common criteria in existence VQAs.
Count: numerical, so it’s accuracy in %.
Reach and Size: has bonus points if answering pos-neg pair correct. (Only one correct: 1pt, both correct:1+1+1 bonus=3). We kind of follow MME in this bonus point criteria.
Thank you very much for your response. I sincerely apologize for not reading the model evaluation section of the paper carefully. I really appreciate you taking the time to address my questions.
Hope you have a good night, @RussRobin.
No worries. Great thanks for your interest in our work! I will close this issue then.
Hi @RussRobin,
Thank you for your great work! I found it very interesting and decided to test the models using SpatialBench. However, I noticed that the performance of your proposed model, SpatialBot-3B, shows almost no improvement compared to Bunny-v1_0-3B in my tests, which doesn’t seem to align with the results in TABLE I of your paper.
I’m wondering if I might have selected the wrong version of the Bunny model, or if there is something I’m missing in the evaluation process.
Could you kindly provide some clarification?
Looking forward to your response!
Best regards