Can't reproduce that Page 6, Table 5, Evaluation on Point Cloud-Text Tasks' Bleu, METEOR and ROUGE_L numbers

zhurob commented 2 months ago

I have used https://github.com/csuhan/OneLLM/blob/main/docs/Evaluation.md:

Point-Text Evaluation PointLLM Caption Download PointLLM data from this link Fill pretrained_path in eval/point_cap_pointllm.py and run: python eval/point_cap_pointllm.py. Evaluate with eval/caption_eval.py. The annotation file is at datasets/Eval/point/pointllm_test_cococap.json

I and several of my team members, all got similar Bleu, METEOR and ROUGE_L to reproduce your Table 5 on OneLLM, we all got very low numbers like below, also, CIDEr is zero. Can you please double check that? We believe that we are using same point cloud files and scripts and model. Thank you. Rob SPICE: 0.094 Bleu_1: 0.104 Bleu_2: 0.065 Bleu_3: 0.045 Bleu_4: 0.034 METEOR: 0.131 ROUGE_L: 0.175 CIDEr: 0.000 SPICE: 0.094

From https://arxiv.org/pdf/2312.03700, Page 6, Table 5, Evaluation on Point Cloud-Text Tasks. The evaluation dataset is from Objaverse [16], following the data split in PointLLM [92]. InstructBLIP takes single-view image as input, while PointLLM and OneLLM take point cloud as input. GPT4- Acc.: GPT4 as the accuracy evaluator [92].

Model Captioning Classification BLEU-1 ROUGE-L METEOR GPT4-Acc. InstructBLIP-7B [15] 11.2 13.9 14.9 38.5 InstructBLIP-13B [15] 12.6 15.0 16.0 35.5 PointLLM-7B [92] 8.0 11.1 15.2 47.5 PointLLM-13B [92] 9.7 12.8 15.3 45.0 One-LLM-7B (Ours) 42.2 45.3 20.3 44.5

csuhan commented 2 months ago

Our point cloud caption results are evaluated with Phase II model: Multimodal Alignment. The final model after instruction tuning tends to output long and detailed response, while the caption benchmark requires a short sentence, making it perform bad on the benchmark.

A simple way to improve it is change the task prompt from: "What is this?" to "Provide a one-sentence caption". https://github.com/csuhan/OneLLM/blob/73393b17a14fa58a179b450a2fe2d2d640dd61fc/eval/point_cap_pointllm.py#L38C21-L38C34

zhurob commented 2 months ago

Good fix. Thank you very much, verified, works.

csuhan / OneLLM

Can't reproduce that Page 6, Table 5, Evaluation on Point Cloud-Text Tasks' Bleu, METEOR and ROUGE_L numbers #23