Questions about the predictions on ScanRefer with the given ckpt

3dlg-hcvc / M3DRef-CLIP

[ICCV 2023] Multi3DRefer: Grounding Text Description to Multiple 3D Objects

https://3dlg-hcvc.github.io/multi3drefer/

MIT License

64 stars 3 forks source link

Questions about the predictions on ScanRefer with the given ckpt #6

Closed Xiaolong-RRL closed 9 months ago

Xiaolong-RRL commented 10 months ago

Dear author:

Thanks for your interesting work.

I have completed the entire process of training and inferencing following the README.md, but when I run the follow command with the given ckpt:

# get the predictions
python test.py data=scanrefer data.inference.split=val +ckpt_path={M3DRef-CLIP_ScanRefer.ckpt} pred_path={predictions_path}

# evaluate predictions
python evaluate.py data=scanrefer pred_path={M3DRef-CLIP_ScanRefer.ckpt} pred_path={predictions_path} data.evaluation.split=val

I get unsatisfactory performance, far lower than your results in readme.md:

===========================================
IoU         unique      multiple    overall     
-------------------------------------------
0.25        45.3        28.6        31.8        
0.50        33.1        21.9        24.1        
===========================================

I wander if it's correct? And how to handle it to achieve the same results as the one in readme.md?

Thanks!!

eamonn-zh commented 10 months ago

Hi @Xiaolong-RRL, sorry for the late reply, just want to confirm you used the checkpoint we provided in the repo, right?

Xiaolong-RRL commented 10 months ago

Hi, I use this checkpoint: https://aspis.cmpt.sfu.ca/projects/m3dref-clip/pretrain/M3DRef-CLIP_ScanRefer.ckpt

And I used the multiview features that were processed here, with the size of 36GB, rather than the one you provided directly https://aspis.cmpt.sfu.ca/projects/m3dref-clip/data/enet_feats_maxpool.hdf5 with the size of 100+ GB, I wonder if this will affect the final evaluation result?

eamonn-zh commented 10 months ago

Yes, you should use the 100+GB one. We follow the prior work D3Net. The only difference between the 36GB and 100+GB versions is the number of points, the former only samples 50,000 points for each scene, while the second is the unsampled original scene. M3Ref-CLIP uses the 100+GB version and does point sampling in the dataloader.

Xiaolong-RRL commented 10 months ago

I see!! but the speed during the download process is very slow. I wonder if it's convenient for you to provide a Baidu Netdisk link, or split it into multiple files and upload them to Google Drive for quick download, thanks!!

eamonn-zh commented 9 months ago

Sure, we are seeking an alternative place to put it and also we will release the instructions for regenerating this file.

Xiaolong-RRL commented 9 months ago

Thanks for your kindly reply, and I am looking forward to it~

eamonn-zh commented 9 months ago

Hi @Xiaolong-RRL. We've updated the README and added instructions for generating the enet_feats_maxpool.hdf5.