Zero-Shot results on 3D_Benchmark and official ckpt

OpenGVLab / LAMM

[NeurIPS 2023 Datasets and Benchmarks Track] LAMM: Multi-Modal Large Language Models and Applications as AI Agents

https://openlamm.github.io/

286 stars 15 forks source link

Zero-Shot results on 3D_Benchmark and official ckpt #41

Closed hmxiong closed 10 months ago

hmxiong commented 10 months ago

Sorry to bother you again. At present, I have used the official code to complete the benchmark test experiment. Here are my questions：

When I use the official ckpt for testing, I find that the accuracy cannot reach the paper. FastChat-0.1.10(https://github.com/lm-sys/FastChat/tree/v0.1.10) is used when merging Vicuna-delta weights, the llama weight used is found above huggingcafe (https://huggingface.co/huggyllama/llama-13b), and the version of transformers is shown in the figure below;
I would like to ask how the accuracy of 99% can be achieved on the 3D VQA task mentioned in the paper. For fine-tuning, is the data used in this part all ScanQA training data?
The third problem is that the results reproduced by myself are also somewhat different from those in the paper. I don’t know where the problem may be. All my experimental data is shown in the figure below.

wangjiongw commented 10 months ago

Thanks for your attention. For your questions,

the vicuna delta weight we used is v0, and I'm not sure its difference from what you used.
finetuning for specific task is to finetune checkpoint trained by LAMM dataset on training set of corresponding dataset.
Actually, we cleaned some typo in datasets and updated LAMM checkpoints. The zero shot performance for 3D detection & 3D VQA are 8.2 & 24.90 respectively, you can check new results in readme. Although it's different from what you have re-implemented, I think the reason is the randomness in LLM's outputs.

hmxiong commented 10 months ago

Thank you very much for the guidance. I think the main reason may be the mismatch of LLaMA weights. Although I use the delta weight of V0, I will change the data set and try again to see if it can solve the problem of recurrence. At this stage when I used the official ckpt test, I found that the results would fluctuate to a certain extent.

hmxiong commented 10 months ago

I don’t know if you can provide a method for making Fintune data. The main purpose here is to ask how to get all the options when testing on the ScanQA data set (how to convert QA questions into multiple-choice questions) and how to convert ScanNet Detection bounding box information to text format

wangjiongw commented 10 months ago

Sorry for late reply. We can upload finetuning data after security checking. Here is how we build the finetuning data from existing dataset.

For ScanQA, we used Chat GPT API and ask it to generate some related and confusing options based on the given ground truth, and then combine all the options as all options of the question. For ScanNet boudning boxes, we represent the bounding box with 6 numbers, which is x,y,z coordinates of center and lengther of edges. And we prompted ChatGPT to generate some templates that can link class label and bounding boxes into a sentence, which is similar for cases of 2D detection.

Hope this can solve your problems.

wangjiongw commented 9 months ago

I don’t know if you can provide a method for making Fintune data. The main purpose here is to ask how to get all the options when testing on the ScanQA data set (how to convert QA questions into multiple-choice questions) and how to convert ScanNet Detection bounding box information to text format

For your reference, finetuning data for ScanQA multiple choice is available on huggingface.