questions about test data.

wefwefWEF2 commented 2 months ago

Thanks for the fantastic work. 20240831-141926

I have some questions. I noticed that the Objaverse evaluation dataset only contains 200 in traditional test results, is that correct? Is the quantity too small for evaluation? and how to generate results for 3000 objects， thanks a lot!!

RunsenXu commented 2 months ago

Hi,

Thank you for your good question!

Yes, we only use 200 samples from Objaverse for evaluation.
There are indeed a few inaccuracies in the human-annotated data in Cap3D. Since we randomly sampled those 200 objects for evaluation, several objects may have incorrect annotations. However, this does not significantly affect our ability to use the benchmark to determine which model performs better.
While using more data for evaluation would be ideal, we need to manage costs since we employ human evaluators and use GPT. Given the consistent results on Objaverse and ModelNet (thousands of data points), we believe using 200 objects is informative enough.
To evaluate the 3000 objects we’ve reserved, I have uploaded the corresponding file here: https://huggingface.co/datasets/RunsenXu/PointLLM/blob/main/PointLLM_brief_description_val_3000_GT.json . :)

Best regards, Runsen

yun263214678 commented 2 months ago

Thanks for your reply!!! And I try to evaluate the results in paper using your released weights,

python /eval/eval_objaverse.py --model_name /pointllm/weights/checkpoints_paper/PointLLM_7B_v1.2 --task_type classification --prompt_index 0 --data_path /data/Objaverse_colored_point_clouds/8192_npy --anno_path /data/instruction-following_data/PointLLM_brief_description_val_200_GT.json

python pointllm/eval/traditional_evaluator.py --results_path /pointllm/weights/checkpoints_paper/PointLLM_7B_v1.2/evaluation/PointLLM_brief_description_val_200_GT_Objaverse_classification_prompt0.json

results like 'Average BLEU-1 Score: 3.0461' is different from 3.87 in the paper, how to get the right results. Is it related to the --batch_size during inference?

RunsenXu commented 2 months ago

Hi,

The model cannot generate exactly the same results each time, so it's normal the have a different number, as long as the deviation isn’t too much.

OpenRobotLab / PointLLM

questions about test data. #36