3d-vista / 3D-VisTA

Official implementation of ICCV 2023 paper "3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment"
https://3d-vista.github.io
MIT License
179 stars 10 forks source link

pc_type #19

Closed dingjiansw101 closed 8 months ago

dingjiansw101 commented 8 months ago

Dear Authors,

I am confused about the pc_type. When will you set the pc_type as "pred"?

Best, Jian Ding

zhuziyu-edward commented 8 months ago

Hi, thank you for your interest in our work. In these 3D-VL tasks evaluation, there are two settings, results using ground-truth mask and results using predicted mask. On benchmarks like Sr3D, Nr3D, you should use gt mask for evaluation. On benchmarks like ScanRefer, ScanQA, Scan2Cap, SQA3D, you should set pc_type to pred.

Best, Ziyu

dingjiansw101 commented 8 months ago

Hi Ziyu,

Thanks for your reply. However, I found that the pc_type is set to "gt" during training and testing in the "ScanQA" task.

Best, Jian Ding

zhuziyu-edward commented 8 months ago

Yes, setting pc_type to "gt" will make testing faster(for checking model performance only). If you want to submit the result to benchmark, you should change it to "pred" for comparison with other paper.

Best, Ziyu

dingjiansw101 commented 8 months ago

Are the data for pc_type "pred" read from the path "./data/scanfamily/save_mask"? How to generate such files? Are they predicted by 3dvista or other models?

Best, Jian Ding

zhuziyu-edward commented 8 months ago

They are predicted by mask3d segmentation model. These masks can be found in this issue #12.

Best, Ziyu

dingjiansw101 commented 8 months ago

Thanks for the reply. Have you fine-tuned the mask3d according to the labels of scanqa, or just used the pretrained mask3d model?

Best, Jian Ding

dingjiansw101 commented 8 months ago

An additional question, have you evaluated on the task "Object localization performance on the ScanQA dataset"?

zhuziyu-edward commented 8 months ago

Hi

  1. We use pre-trained mask3d from their repo(ScanNet200 checkpoint).
  2. In our implementation, object localization accuracy is around 56% using the ground-truth mask in ScanQA.

Best, Ziyu

dingjiansw101 commented 8 months ago

Hi Ziyu, Thanks for your reply. I still have a few questions since I am new to this area.

  1. What is the meaning of "using the ground-truth mask"? Did you use the ground truth mask just for evaluation, or send the ground truth mask to the model and just predict the localization scores? Is 56% under the metric Acc@0.25?
  2. Have you included the evaluation code for object localization in the repo?
  3. An additional question, the scanqa paper found that the object localization loss is helpful for the qa task. Did you have the similar finding?
  4. I found it seems that you used 607 raw categories from scannetv2. However, in the scanqa paper, they used 18 categories. I am confused about this. Are the 18 categories merged from the 607 categories? And will the category difference influence the qa performance? Best, Jian Ding
zhuziyu-edward commented 8 months ago

Hi

  1. means "setting pc to gt" to evaluate localization acc regardless of IoU.
  2. Yes, it is in "eval_qa" function, "og_acc" metric.
  3. We did not conduct rigorous experiments to study the effect of this localization loss, it is usually used in ScanQA to prevent overfitting. Thus, we follow this setting and add this loss.
  4. Raw 607 class is the full set of ScanNet semantics. 607 class can be merged to 18 categories(ScanNet20, remove wall and floor) or 200 categories(ScanNet200).
  5. Best, Ziyu