Open3DA / LL3DA

[CVPR 2024] "LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning"; an interactive Large Language 3D Assistant.
https://ll3da.github.io/
MIT License
200 stars 6 forks source link

Issue with reproducing experiment results. #18

Open gaohan-cmd opened 1 month ago

gaohan-cmd commented 1 month ago

Hi, I think your work is very meaningful to me, but I encountered some issues while trying to replicate it.Are you using the pre-trained weights from https://huggingface.co/CH3COOK/LL3DA-weight-release/tree/main for the table 5 experiment? I used the following command to evaluate the ScanRefer results as shown in figure 1. python main.py \ --use_color --use_normal \ --detector detector_Vote2Cap_DETR \ --captioner ll3da \ --checkpoint_dir ./ckpts/opt-1.3b/ll3da-generalist \ --test_ckpt ./ckpts/opt-1.3b/ll3da-generalist/ll3da-opt-1.3b.pth \ --dataset unified_densecap_scanrefer \ --vocab facebook/opt-1.3b \ --qformer_vocab bert-base-embedding \ --dist_url tcp://localhost:222 \ --criterion 'CiDEr@0.5' \ --freeze_detector --freeze_llm \ --batchsize_per_gpu 8 --ngpus 2 \ --max_des_len 256 \ --max_prompt 1 \ --use_beam_search \ --test_only

I fine-tuned it first using the following command. python main.py \ --use_color --use_normal \ --detector detector_Vote2Cap_DETR \ --captioner ll3da \ --pretrained_weights ./ckpts/opt-1.3b/ll3da-generalist/ll3da-opt-1.3b.pth \ --warm_lr_epochs 0 \ --dataset unified_densecap_scanrefer \ --vocab facebook/opt-1.3b \ --qformer_vocab bert-base-embedding \ --checkpoint_dir ./ckpts/opt-1.3b/ll3da-scanrefer-tuned \ --max_epoch 16 \ --dist_url tcp://localhost:222 \ --eval_every_iteration 4000 \ --start_eval_after -1 \ --save_every 10000 \ --criterion 'CiDEr@0.5' \ --freeze_detector --freeze_llm \ --batchsize_per_gpu 8 --ngpus 2 --base_lr 1e-6 --final_lr 1e-6 \ --max_des_len 256 \ --max_prompt 1 --use_beam_search After finishing, use the checkpoint_best.pth for evaluation. The command is as follows, but my experimental results did not reach the 65.19 as in the paper. What could be the issue? python main.py \ --use_color --use_normal \ --detector detector_Vote2Cap_DETR \ --captioner ll3da \ --checkpoint_dir ./ckpts/opt-1.3b/ll3da-scanrefer-tuned \ --test_ckpt ./ckpts/opt-1.3b/ll3da-scanrefer-tuned/checkpoint_best.pth \ --dataset unified_densecap_scanrefer \ --vocab facebook/opt-1.3b \ --qformer_vocab bert-base-embedding \ --dist_url tcp://localhost:222 \ --criterion 'CiDEr@0.5' \ --freeze_detector --freeze_llm \ --batchsize_per_gpu 8 --ngpus 2 \ --max_des_len 256 \ --max_prompt 1 \ --use_beam_search \ --test_only

image image

ch3cook-fdu commented 1 month ago

Please see https://github.com/Open3DA/LL3DA/issues/11 for more details.

YiwuZhong commented 1 month ago

@ch3cook-fdu Thanks for your explanation and nice work!

However, I met the same issue as @gaohan-cmd, by using the pre-trained model weights you uploaded in huggingface. After finetuning, I got 61.8@CIDEr and 35.0@B4, while the reported results in paper are 65.2@CIDEr and 36.8@B4.

I understand that there would be some randomness. But ~3% gap in CIDEr and ~2% gap in B4 are already large. Could you please also verify the reproduction of paper results on your side?

ch3cook-fdu commented 1 month ago

To unleash the full potential of LL3DA, I encourage you to:

  1. Train the Vote2Cap-DETR model to align your copy of 3D points with the scene encoder weights.
  2. Train the LL3DA generalist to see whether the results align.

Because the scene encoder is frozen, the ability to perceive the 3D scene might be the bottleneck for reproduction.

YiwuZhong commented 1 month ago

@ch3cook-fdu Thanks for your response and suggestion!

Is there any script in this repo (LL3DA) that I can follow to train the detector?

ch3cook-fdu commented 1 month ago

You can follow the instructions in https://github.com/ch3cook-fdu/Vote2Cap-DETR, and copy the pretrained weights to this repo.

I might try uploading the point cloud data I processed as well, to see whether your reproduction aligns.

YiwuZhong commented 1 month ago

Uploading your processed data would be helpful to me and other researchers. Thanks!

ch3cook-fdu commented 3 weeks ago

Uploading your processed data would be helpful to me and other researchers. Thanks!

Hi we finally managed to upload the processed data to https://huggingface.co/CH3COOK/LL3DA-weight-release/blob/main/scannet_data.zip .

YiwuZhong commented 2 weeks ago

Uploading your processed data would be helpful to me and other researchers. Thanks!

Hi we finally managed to upload the processed data to https://huggingface.co/CH3COOK/LL3DA-weight-release/blob/main/scannet_data.zip .

@ch3cook-fdu Thank you for uploading the data. One more thing I noticed is that, the detector uses the "aligned" version of vertices and boxes, unlike 3DETR. Is there any reason for doing this?

ch3cook-fdu commented 2 weeks ago

For 3D-VL studies, it is a common practice to use the axis-aligned 3D data. You can refer to other repos like ScanRefer and Scan2Cap.