Questions about performance

jkstyle2 commented 1 month ago

Thanks for sharing your great work!

I have some questions about your paper work. There're 2 options for your inputs: w/o 2D, w/ 2D. I initially thought that features from w/ 2D could outperform the features from w/o 2D, but it wasn't in your paper. In Table1 from Vote2Cap-DETR++, some benchmarks like B-4, M, R were acquired better than w/ 2D. How is it possible and why should we use these multiview features, which are not effective in performance and also could be hard to be extracted.

In addition, the 3detr is used as the encoder/decoder for your model. As 3detr does not perform well in 3D detection benchmark like ScanNet, compared to other non-transformer based architectures, can I substitute the encoder/decoder to other models? would it perform well? For instance, the recently released 3D detector like V-DETR is based on 3detr, so that it would be another option for better performance for your model.

ch3cook-fdu commented 1 month ago

The introduction of multi-view feature is to substitute the color information with multi-view feature for each point. Therefore, it is not an incremental move. We notice the introduction of multi-view feature will lead to improved performance if we switch to the VoteNet backbone.
It will lead to better performance if you use better backbones, e.g. V-DETR is able to take more points as the input, and preserve more information for down-stream tasks.

jkstyle2 commented 1 month ago

Can you share where did you notice about the improved performance with VoteNet in the paper? Also, in terms of detection performance, can I consider it has similar mAP performance to 3detr as it's based on 3detr?

ch3cook-fdu commented 1 month ago

We did not include such ablations in our paper, you can reproduce the results by setting —detector detector_votenet.

The evaluations are different because of 1) the point clouds are axis-aligned, and 2) the categories are different.

jkstyle2 commented 1 month ago

What does specifically 'axis-aligned' mean in the context? I thought the annotated boxes are already axis-aligned in scannet. Although some alignment techniques are applied, does it affect to any results in terms of model evaluation?

ch3cook-fdu commented 1 month ago

Please compare between: https://github.com/facebookresearch/votenet/blob/main/scannet/scannet_detection_dataset.py#L77-L81 and https://github.com/ch3cook-fdu/Vote2Cap-DETR/blob/master/datasets/scannet.py#L205-L212

jkstyle2 commented 1 month ago

Well, I checked the codes and it seems just literally to align the point clouds to a reference axis. It is considered not to affect to the overall detection results, so that I'm unsure of why it is mentioned in 1) the point clouds are axis-aligned.

Also, I found the pre-processing file 'batch_load_scannet_data' from VoteNet is now outmoded among SOTA detectors as it limits the number of point clouds into 50k. As reported here, there was a significant mAP drop whether using the sampling or not. I wonder if it affects to your model as well, as you're using the original base code.

ch3cook-fdu commented 1 month ago

Justification of the point clouds are axis-aligned: As long as the input data is different, the numerical comparisons are always considered unfair.
Given more scene information, the model might achieve better performance. You are free to try different hyperparameters settings. However, it is not guaranteed, since it is not the focus of our research.

jkstyle2 commented 1 month ago

I have some questions regarding evaluation result on ScanRefer dataset as below (from scanrefer_scst_vote2cap_detr_pp_XYZ_RGB_NORMAL.pth on scene0568_00).

The absurdly large 3D boxes often appear. The results are: "the pillow is on the right. it is to the right of the couch", "there is a rectangular couch. it is to the left of the room". Those objects are not likely to fit that big in real world. I thought the detector estimates the size, taking into account of the average size. Can you tell me how it could happen and how it can be solved?

The above result is "this is a black pillow. it is on the right side of the couch". I wonder how the model could fail detecting the color of object although the color information is simply and obviously visualized. For better accuracy, should I set some confidence threshold higher?

Thanks for your help in advance!

ch3cook-fdu commented 1 month ago

Since there are some L-shaped couch / desk in the dataset, the model might predict large bounding boxes. You can check the predicted logits for the bounding boxes.
This is also known as the hallucinations in language models. Maybe you can ease this with reinforcement learning techniques with special reward designs on the wrong predictions for attribute information.

jkstyle2 commented 1 month ago

well, I understand there could be a L-shaped large couch or deck in real world, but still it seems too big for me. Even one of the large boxes refers pillow (the size of box looks around (10m*10m)). I hardly understand how it could happen.
I see. Is there any confidence or threshold value that we can suppress those estimations? I wonder if there is a value like objectness or semantic score for captioning task.

ch3cook-fdu commented 1 month ago

You may try setting hard threshold rather than the argmax operation in these lines: https://github.com/ch3cook-fdu/Vote2Cap-DETR/blob/master/engine.py#L390-L392.

jkstyle2 commented 1 month ago

I mean I want to filter the bounding boxes based on the threshold value related to the captioning task, not just the detection task. The lines you shared seem to filter out based on detection results. It seems the overall 3D detection result could be considered well, so that I'd like to control the captioning task. Is it possible in your method?

ch3cook-fdu / Vote2Cap-DETR

Questions about performance #19