Closed KaneKun closed 5 years ago
(1) Thank you for this thought-provoking question that we actually overlooked. Here I just give my thoughts about this issue for discussion:
First, we observed no obvious difference in our experiments of these two settings, i.e., without normalization vs. with normalization. To be honest, the latter seems to work better in RS-CNN...
Second, revisiting classic image conv, a small conv kernel learns from only an image patch and it is shared along the spatial dimension. With this mechanism, image conv can learn abstract semantic information from local to global, in other words, from patches to the whole image. Therefore, I think RS-CNN working in the same way but the semantics are learned with points' relation encoded explicitly.
Third, I speculate that the reason of DGCNN working better with "absolute coordinates" added is the use of EdgeConv, in which both x_i and (x_j-x_i) are in feature space rather than 3D space (except for the 1st layer). In other words, different settings maybe work in different networks. However, we also expect that the "absolute coordinates" have a potential to improve RS-CNN, which may need a further exploration.
Question (2) will be responsed in next daytime.
(2) Note that the robustness to rotation that we claim in the paper is held in the high-level relation mapping M(h) in Eq. (2), when a suitable h is defined. However, the initial input features of 3D coordinates are still affected by rotation.
We address this issue by normalizing (include rotating) each sampled point subset to corresponding local coordinate system, which is determined by each sampled point and its normal. In this way, all the local point subsets are normalized to their corresponding local coordinate systems that can result in the invariance to rotation. Yet note that the accuracy drops a lot, because this forcible normalization could bring difficulty for shape recognition.
We have clarified this part in a paragraph titled with "Robustness to point permutation and rigid transformation" in Page 8.
Thank you for your kind explanation.
(1) This is interesting, and I would like to see if such normalization w.r.t a local centroid would affect the performance much. Besides, I wonder if RS-CNN would still perform well for the scene parsing task, where the spatial relations among different scene components play a critical role in the scene understanding. This could be an interesting future work.
(2) According to the experimental details you provide, it seems like RS-CNN itself cannot achieve intrinsic rotation invariance due to the input 3D coordinates. From your solution to overcoming this issue, I am worried if the normal vectors are unavailable, such normalization could be difficult.
Interestingly, from the comparison with PointNet++, I found that, the decreased performance of PointNet++ w.r.t to the rotation angle is also caused by its use of absolute coordinates (i.e., the FPS sampled centroid points, which are input to the MLP).
Recently I found one work which attempts to solve the rotation invariance, which is based on spherical convolution: "PRIN: Pointwise Rotation-Invariant Networks". But the proposed method in this work does not perform well when the input shapes are aligned to canonical position.
(1) In my opinion, RS-CNN could do better in high-level shape identification from an object represented by irregular points, than in the whole scene parsing. The reason is that I think the low-level geometric relations are more suitable for capturing the spatial layout of 3D points than capturing the underlying dependencies among scene components. Of course, I agree with you that the relation learning method (RS-Conv) proposed in this paper could provide an inspiration for scene parsing :D
(2) Yes. The robustness to rotation we claim is in the relation learning, and it can be achieved only when the predefined relation is rotation invariant.
The intrinsic rotation invariance is a promising and challenging problem. Most existing methods achieve the robustness rather than the invariance. I speculate that the invariance could be achieved by some mappings which is of information lossless. Besides PRIN you mentioned, my friend did a work called "Spherical Fractal Convolution Neural Networks for Point Cloud Recognition", which is also an excellent exploration.
I agree that intrinsic rotation invariance is a promising and challenging problem. In fact, I don't quite understand why some so-called expert reviewers (for CVPR/NIPS etc conferences) consider achieving rotation invariance to be an easy and trivial task...
I found your friend's work quite interesting. Thank you for such a wonderful reference.
(2) Note that the robustness to rotation that we claim in the paper is held in the high-level relation mapping M(h) in Eq. (2), when a suitable h is defined. However, the initial input features of 3D coordinates are still affected by rotation.
We address this issue by normalizing (include rotating) each sampled point subset to corresponding local coordinate system, which is determined by each sampled point and its normal. In this way, all the local point subsets are normalized to their corresponding local coordinate systems that can result in the invariance to rotation. Yet note that the accuracy drops a lot, because this forcible normalization could bring difficulty for shape recognition.
We have clarified this part in a paragraph titled with "Robustness to point permutation and rigid transformation" in Page 8.
Thanks for your inspiring work and the explanation of the rotation invariance property. However, I'm wondering how the local coordinate system can be defined using the sampled point and its normal. One naive way I can imagine is to use the sampled point as (0,0,0) and the normal as one coordinate axis. However, it is still not enough to build a coordinate system. Could you please give more details of the approach? Thank you so much!
Thank you for your excellent work. It is very interesting.
After reading your paper, I have a few questions about your network design, which makes me feel quite confused:
(1) On page 5, the first line of left column, "they are normalized to take the centroid as the origin." So do you mean you just ignore all the absolute coordinates of points? If so, this would transform each point subset into local patch. However, this would lead to inferior performance, as pointed out by the paper "Dynamic Graph CNN for Learning on Point Clouds" (page 5, the third paragraph on the right column): "Note that such a choice encodes only local information, essentially treating the shape as a collection of small patches and losing the global shape structure." In addition, I observe that when using only Euclidean distance as the relation h (so there is no absolute coordinate information utilized in the network), the classification performance of RS-CNN could still achieve 92.5, which is quite impressive. I wonder if this is contradictory to the statement made by "Dynamic Graph CNN for Learning on Point Clouds"?
(2) I observe that RS-CNN has one important property - rotation invariance. But if the absolute coordinates of points are input to the network as initial features, I think RS-CNN cannot achieve intrinsic rotation invariance. Am I missing some important information?
Thanks a lot.