Open gaohan-cmd opened 5 months ago
Please see https://github.com/Open3DA/LL3DA/issues/11 for more details.
@ch3cook-fdu Thanks for your explanation and nice work!
However, I met the same issue as @gaohan-cmd, by using the pre-trained model weights you uploaded in huggingface. After finetuning, I got 61.8@CIDEr and 35.0@B4, while the reported results in paper are 65.2@CIDEr and 36.8@B4.
I understand that there would be some randomness. But ~3% gap in CIDEr and ~2% gap in B4 are already large. Could you please also verify the reproduction of paper results on your side?
To unleash the full potential of LL3DA, I encourage you to:
Because the scene encoder is frozen, the ability to perceive the 3D scene might be the bottleneck for reproduction.
@ch3cook-fdu Thanks for your response and suggestion!
Is there any script in this repo (LL3DA) that I can follow to train the detector?
You can follow the instructions in https://github.com/ch3cook-fdu/Vote2Cap-DETR, and copy the pretrained weights to this repo.
I might try uploading the point cloud data I processed as well, to see whether your reproduction aligns.
Uploading your processed data would be helpful to me and other researchers. Thanks!
Uploading your processed data would be helpful to me and other researchers. Thanks!
Hi we finally managed to upload the processed data to https://huggingface.co/CH3COOK/LL3DA-weight-release/blob/main/scannet_data.zip .
Uploading your processed data would be helpful to me and other researchers. Thanks!
Hi we finally managed to upload the processed data to https://huggingface.co/CH3COOK/LL3DA-weight-release/blob/main/scannet_data.zip .
@ch3cook-fdu Thank you for uploading the data. One more thing I noticed is that, the detector uses the "aligned" version of vertices and boxes, unlike 3DETR. Is there any reason for doing this?
For 3D-VL studies, it is a common practice to use the axis-aligned 3D data. You can refer to other repos like ScanRefer and Scan2Cap.
@ch3cook-fdu Thanks for your provided data and the paper results can be reproduced using pre-trained weights.
On the other side, we tried to train the detector using LL3DA repo, however, fail to reproduce the detection performance trained by your Vote2Cap-DETR repo (~2.0 mAP50 gap). We already tried fixing the random seed and re-enable use_random_cuboid. What other gap exists between these two repos in terms of the detector part? Thanks!
Could you provide me with more details on the choice of hyper parameters? Try using 1 GPU with a batch size of 8 might help.
@ch3cook-fdu The following script is used in LL3DA repo to train the detector. Did we miss any thing from Vote2Cap-DETR?
python main.py \
--use_color \
--use_normal \
--detector detector_Vote2Cap_DETR \
--warm_lr_epochs 9 \
--dataset scannet \
--checkpoint_dir ./Vote2Cap_DETR_XYZ_COLOR_NORMAL \
--max_epoch 1080 \
--eval_every_iteration 2000 \
--start_eval_after 1999 \
--save_every 2000 \
--criterion 'mAP@0.5' \
--batchsize_per_gpu 8 \
--ngpus 1 \
--base_lr 5e-4 \
--final_lr 1e-6 \
--lr_scheduler 'cosine' \
--weight_decay 0.1 \
--clip_gradient 0.1
The random cuboid is disabled in our implementation. https://github.com/Open3DA/LL3DA/blob/main/datasets/scannet.py#L27
Please try using the original Vote2Cap-DETR repo for reproduction.
Hi, I think your work is very meaningful to me, but I encountered some issues while trying to replicate it.Are you using the pre-trained weights from https://huggingface.co/CH3COOK/LL3DA-weight-release/tree/main for the table 5 experiment? I used the following command to evaluate the ScanRefer results as shown in figure 1.
python main.py \ --use_color --use_normal \ --detector detector_Vote2Cap_DETR \ --captioner ll3da \ --checkpoint_dir ./ckpts/opt-1.3b/ll3da-generalist \ --test_ckpt ./ckpts/opt-1.3b/ll3da-generalist/ll3da-opt-1.3b.pth \ --dataset unified_densecap_scanrefer \ --vocab facebook/opt-1.3b \ --qformer_vocab bert-base-embedding \ --dist_url tcp://localhost:222 \ --criterion 'CiDEr@0.5' \ --freeze_detector --freeze_llm \ --batchsize_per_gpu 8 --ngpus 2 \ --max_des_len 256 \ --max_prompt 1 \ --use_beam_search \ --test_only
I fine-tuned it first using the following command.
python main.py \ --use_color --use_normal \ --detector detector_Vote2Cap_DETR \ --captioner ll3da \ --pretrained_weights ./ckpts/opt-1.3b/ll3da-generalist/ll3da-opt-1.3b.pth \ --warm_lr_epochs 0 \ --dataset unified_densecap_scanrefer \ --vocab facebook/opt-1.3b \ --qformer_vocab bert-base-embedding \ --checkpoint_dir ./ckpts/opt-1.3b/ll3da-scanrefer-tuned \ --max_epoch 16 \ --dist_url tcp://localhost:222 \ --eval_every_iteration 4000 \ --start_eval_after -1 \ --save_every 10000 \ --criterion 'CiDEr@0.5' \ --freeze_detector --freeze_llm \ --batchsize_per_gpu 8 --ngpus 2 --base_lr 1e-6 --final_lr 1e-6 \ --max_des_len 256 \ --max_prompt 1 --use_beam_search
After finishing, use the checkpoint_best.pth for evaluation. The command is as follows, but my experimental results did not reach the 65.19 as in the paper. What could be the issue?python main.py \ --use_color --use_normal \ --detector detector_Vote2Cap_DETR \ --captioner ll3da \ --checkpoint_dir ./ckpts/opt-1.3b/ll3da-scanrefer-tuned \ --test_ckpt ./ckpts/opt-1.3b/ll3da-scanrefer-tuned/checkpoint_best.pth \ --dataset unified_densecap_scanrefer \ --vocab facebook/opt-1.3b \ --qformer_vocab bert-base-embedding \ --dist_url tcp://localhost:222 \ --criterion 'CiDEr@0.5' \ --freeze_detector --freeze_llm \ --batchsize_per_gpu 8 --ngpus 2 \ --max_des_len 256 \ --max_prompt 1 \ --use_beam_search \ --test_only