Open jkstyle2 opened 1 month ago
Can you share where did you notice about the improved performance with VoteNet in the paper? Also, in terms of detection performance, can I consider it has similar mAP performance to 3detr as it's based on 3detr?
We did not include such ablations in our paper, you can reproduce the results by setting —detector detector_votenet
.
The evaluations are different because of 1) the point clouds are axis-aligned, and 2) the categories are different.
What does specifically 'axis-aligned' mean in the context? I thought the annotated boxes are already axis-aligned in scannet. Although some alignment techniques are applied, does it affect to any results in terms of model evaluation?
Well, I checked the codes and it seems just literally to align the point clouds to a reference axis.
It is considered not to affect to the overall detection results, so that I'm unsure of why it is mentioned in 1) the point clouds are axis-aligned
.
Also, I found the pre-processing file 'batch_load_scannet_data' from VoteNet is now outmoded among SOTA detectors as it limits the number of point clouds into 50k. As reported here, there was a significant mAP drop whether using the sampling or not. I wonder if it affects to your model as well, as you're using the original base code.
the point clouds are axis-aligned
: As long as the input data is different, the numerical comparisons are always considered unfair. I have some questions regarding evaluation result on ScanRefer dataset as below (from scanrefer_scst_vote2cap_detr_pp_XYZ_RGB_NORMAL.pth on scene0568_00).
Thanks for your help in advance!
You may try setting hard threshold rather than the argmax operation in these lines: https://github.com/ch3cook-fdu/Vote2Cap-DETR/blob/master/engine.py#L390-L392.
I mean I want to filter the bounding boxes based on the threshold value related to the captioning task, not just the detection task. The lines you shared seem to filter out based on detection results. It seems the overall 3D detection result could be considered well, so that I'd like to control the captioning task. Is it possible in your method?
Thanks for sharing your great work!
I have some questions about your paper work. There're 2 options for your inputs: w/o 2D, w/ 2D. I initially thought that features from w/ 2D could outperform the features from w/o 2D, but it wasn't in your paper. In Table1 from Vote2Cap-DETR++, some benchmarks like B-4, M, R were acquired better than w/ 2D. How is it possible and why should we use these multiview features, which are not effective in performance and also could be hard to be extracted.![image](https://github.com/ch3cook-fdu/Vote2Cap-DETR/assets/34189274/7bce95ed-3bef-4e3e-a65f-8f9594e49a0e)
In addition, the 3detr is used as the encoder/decoder for your model. As 3detr does not perform well in 3D detection benchmark like ScanNet, compared to other non-transformer based architectures, can I substitute the encoder/decoder to other models? would it perform well? For instance, the recently released 3D detector like V-DETR is based on 3detr, so that it would be another option for better performance for your model.