USIP results in Figure 3

rui2016 commented 4 years ago

Hi, can you explain the performance difference of USIP on KITTI in Fig 3 of your paper (15%-20%) vs. that in Fig 4 of the USIP paper (30%-60%)? The only hint I can find in your paper is

since USIP and D3Feat use different processing and splitting strategies and USIP requires surface normal and curvature as input, the results are not directly comparable.

However, to me this doesn't fully answer the question. Also as I remember (not 100% sure), USIP doesn't need normal and curvature.

Besides, probably as a drawback of their approach, the performance of USIP is sensible to the two parameters M (number of samples) and K (number of nearest neighbors). I assume this is especially the case if you take their pre-trained model and apply it to another dataset like 3DMatch. Did you try to select proper M, K values for a fair comparison in the first plot of Fig 3?

XuyangBai commented 4 years ago

Hi @rui2016 Thanks for your interest. I think USIP did use the surface normal and curvature for KITTI (You can see here or here). As you say, the two hyperparameters M and K are critical for the performance of USIP, but I use the pre-trained weight on Oxford released by the author (which I get similar result of USIP on their code) and test under my setting, so the value of M and K should not be the reason in this case.

Let me explain the detail on USIP and D3Feat use different processing and splitting strategies. During our experiments, we find the group truth poses of KITTI to be noisy so we first use the ICP to refine the pose. After this refinement process, the testing pairs and their poses are different from USIP. Also we use sequence 0-5 for training, 7-8 for validation and 9-11 for testing while USIP are tested on all 11 sequences. The result of USIP in my paper is got by running USIP model on our dataset, that's why I said the results are not directly comparable. And I also find in my experiment, the result of USIP on KITTI is quite weird as the repeatability on #4 keypts is 15% already, but it doesn't increase as the keypoints number becomes large. I have also tried different values of NMS radius, so it's hard for me to explain it now.

rui2016 commented 4 years ago

Hi @XuyangBai, thanks for the explanation.

Normal and curvature: I think you are right. I didn't check their code in detail but from there it seems they indeed use the normal and curvature. It is a bit misleading that the authors don't mention this explicitly in the paper (correct me if I'm wrong). They only mention that the point-to-point loss (Eq 8) is used by default, instead of the point-to-plane loss (Eq 9), which gives me the impression that only xyz are needed.
M and K: When I talked about these parameters above, I was mainly referring to your result on 3DMatch, not on KITTI. From my understanding, these two parameters, together with the size of the local point cloud, roughly define the receptive field of each keypoint. If one takes the pre-trained model and apply it to a new dataset, special care needs to be taken to make sure the receptive field is comparable to that during the training. In other words, one may need to experiment with different point cloud sizes, M and K to get good performance. In the USIP paper, one of the three models is trained on 3DMatch, but the authors don't use 3DMatch for testing, which makes it not so clear what are the proper values for M, K and point cloud size for 3DMatch. So for your paper, including some details on these parameters for the first plot in Fig 3 would be helpful.
USIP on KITTI: If the only difference is testing on sequence 9-11 vs. testing on all the 11 sequences, the performance gap should not be so large (15%-20% vs. 30%-60%). When you tested USIP on 9-11, did you use normal and curvature? Since you used their pretrained model which requires them as input, I guess a proper setting is also to provide them in your experiment. All in all, since the same pre-trained model and dataset were used, convincing discussions need to be provided regarding the gap between the first plot in Fig 4 in the USIP paper and the first plot in Fig 3 in your paper.

XuyangBai commented 4 years ago

Hi @rui2016

I think the author of USIP didn't mention the normal and curvature explicitly in their paper. I found it when I am trying to run their code on my processed dataset.
I agree with you about the effect of M and K but I am not sure whether I understand your idea correctly. From their code, USIP uses the same values of node_num(M) and For node_knn_k_1(K) for training and testing. For RGB-D setting, I am also not sure why the author of USIP trained their model on 3DMatch dataset while tested on Redwood, but I think the training setting (M & K) they use should be the proper value for 3DMatch (I don't understand why we should assume the M, K for training on 3DMatch are not the proper value), so I am just using their setting for testing on 3DMatch. About the input point cloud size, I am using the voxel downsample of 0.03m grid size, which results in about 13000 points for each point cloud fragment in 3DMatch on average, while USIP uses random sampling of 10240 points. Using different sampling schemes for training and testing might not be a good idea to compare the performance of USIP and ours.
I am sorry that I didn't provide all the details about this experiment setting. When I tested USIP on sequence 9-11 of KITTI dataset, I use the same setting of the original paper (weight pre-trained on Oxford dataset, curvature and surface normal calculated by code here, as well as the NMS radius) but on my point cloud fragment pairs and their poses. I also tried this setting on the data provided by USIP and got similar results of their paper. So I ascribe this result to the different data D3Feat and USIP were training on (changed by ICP refinement process). I also didn't have a quite convincing explanation about this so I just said the results are not directly comparable. I will spend more time on this and maybe give a better explanation later.
I have also asked the author of USIP about their performance on KITTI and RGB-D dataset. From both their results and my experiments, USIP is very good at processing outdoor scenes than RGB-D dataset, it is impressive that USIP achieves more than 30% reliability even only 4 keypoints are selected. In my view, outdoor scenes are much harder due to the sparsity variations, larger range, occlusion, etc. But we still didn't have a good understanding of that, any explanation are appreciated.

Thank you again for your valuable discussion.

Best, Xuyang

rui2016 commented 4 years ago

Hi Xuyang,

thanks for the detailed reply. I've got some ideas on most of my questions. The remaining one is still the inconsistent USIP results on KITTI.

So I ascribe this result to the different data D3Feat and USIP were training on (changed by ICP refinement process).

Please let me clarify that the confusion is not from the comparison between USIP and D3Feat, but rather the inconsistency between the USIP results on KITTI in your paper and those in the USIP paper. Although your test data is further aligned using ICP, I would assume such better aligned test data (together with the fact that difference numbers of sequences are used for testing) should not degrade USIP's repeatability so significantly (more than 50%). This is only my assumption, though.

XuyangBai / D3Feat

USIP results in Figure 3 #4