Questions on paper - Githubissues

Hey, I have some small questions on the paper:

Line 22 of Algorithm 1: So am I assuming this right, at this point, there will typically be 1 point in each bin, so for 10 bins, mu^*_delta = 1/10 * (-0.1*log(0.1) - 0.1*log(0.1)...) = 0.230259 ?
Line 25 of Algorithm 1: Here, you are calling Hbar_Delta as a function with 2 parameters. Equation 34 defines it as a function with 1 parameter. I am confused as to how the term required in Algorithm 1 is being computed. How does mu^*_delta play a role here? Maybe, what you mean is to take equation 34, and use the already computed mu^*_delta as a denominator? Am I seeing this right, computing this Hbar term only depends on how the number of points within each bin - nothing else?
Equation 37: "The primary and secondary keys of sampling from a bin k are..." What are the "primary and secondary keys of sampling" and how do they play a role in the algorithm? I cannot see this part being referenced in the algorithm or anywhere else in the paper?
What gradient flow value do you use for points that do not have any neighbors within the radius lambda_p? Do you discard all of those points or always select those points?
Computing the neighborhood for each point can be quite expensive in my opinion. Am I seeing this right, that you were able to keep this part fast, just because the sensor data that you used was almost sorted (like Ouster)? Say you had truly unsorted point clouds, then this part would be quite expensive, wouldn't it?
Do you apply this downsampling before or after deskewing scans?
Lastly: The results you seem to be achieving are quite impressive. I am trying to get some intuition: How does your gradient flow heuristic not really depend on where the points are or how things are oriented and still work robustly? In fact, you are not even using the delta_p vector for the most part, but its norm, which is even more ambiguous. For example Deschaud (see https://arxiv.org/pdf/1802.08633.pdf ) suggested Heuristics that tried to achieve a "similar" useful downsampling of points. His metrics intuitively make more sense to be, because they try to avoid degeneracies... but it seems like your heuristic is simpler, and achieves better(?) results. Have you compared to his heuristics? Can you give me some intuition why this works?

Thank you!

Hey Frank, thanks for these very good questions!

You've got it right from what is in the preprint, but it seems we've made an error in Eq. (38) and on L22. mu^*_delta represents the maximum entropy rate by Eq. (35), but that is necessarily not given at K points/bins, as wrongly stated in the preprint. Example: see 3:32 where max rate is at 3 pts, not K. Instead, we can be sure that mu^*_delta lies in the interval <1, K>. Implementation-wise, we look for mu^*_delta inside the L19 for cycle. I'll update Eq. (38) and line 22 in the revision.
L25 and Eq. (34) are the same. I think that on L25, we tried to emphasize that mu^*_delta is computed just once and used as an inverse normalization factor in Eq. (34). This might be a bit confusing and L25 and Eq. (34) should use the same notation (will update, thanks!). The information flow is: distribution of points in the histogram -> discrete PDF -> set entropy rate (Eq. (32)) -> normalized set entropy rate (Eq. (34)). This means that the Hbar in Eq. (34) depends on the points' distribution in the histogram (since in our case, entropy relates to the distribution).
Each bin contains all points lying in the given GFH-norm interval. When selecting a point from this bin, these keys specify the order of points for sampling from this bin. The primary key says that the points with greater GFH norm will be sampled first. The secondary key says the farther point will be sampled first out of any two points with equal GFH norm. This happens inside the pop() operation on L21 and L26.
The GFH of these points is 0. We do not discard these points; they become part of the <0, 1/K) bin. They are sampled if no non-zero GFH-norm points are left in this bin. However, in practice, almost every point has a non-zero GFH norm, which makes the no-neighbor points the lowest priority. Intuitively speaking, the inability to derive any geometrical information from the point makes it an unreliable candidate anyway, so this behavior is desired. We experimented with discarding and persisting these points (or LOAM features) quite a lot, and we've seen no added effect on the performance in any case (discard/persist).
Correct, NN search is the most expensive part of the algorithm (about 90% of the total RMS runtime). In Fig. 5b, the red bar includes the NN search --- it turns out that the comp. overhead for RMS is lower than the comp. gains in the subsequent optimization, resulting in a speed-up in total runtime. We do the search using KD trees, but KD-tree construction and all-point queries would be untractable if all points from the sensor were used (point count reaching 128x2048 @ 20 Hz). Voxelization of the point cloud at L11 before KD-tree construction helps in this regard (although with a rather small voxel size; see Table IV.) RMS does not expect the data to be ordered as the KD-tree impl. has to process unsorted points since the prior voxelization breaks any order (at least in the used impl.). In fact, the Hilti-Oxford or KITTI datasets contain unordered data. The only information about the sensor we use is on lines L6-7. If available, these parameters are used in deriving the lookup radius lambda_p in Eq. (29). If unavailable, we use lambda_p = 2v.
No deskewing is included in our experiments --- we did not want an unnecessary feature to influence the comparison. Since deskewing moves the points, it will affect the geometry-related GFH metric. It will likely have a minimal effect on RMS since GFH is a local metric. If so, RMS could be applied before deskewing. Fewer points could then speed up deskewing itself, leading to higher comp. gains in overall. But that is just a hypothesis to be tested.
Honestly, this is the first time I'm seeing this work. From a quick read, I can see that the method is similar to normal-space sampling [Rusinkiewicz and Levoy, 2001], to which we're comparing RMS. Both use the fact that a point's normal correlates with and can be projected to the DoFs --- Rusinkiewicz samples the normals directly and achieves similar results to Deschaud, who samples per DoF (might be even more expensive).
- In a sense, you can think about the GFH as a normal. In Deschaud's formulation, zero GFH relates to a plane, and the point's contribution relates to the projection of the plane normal to the DoFs. In our formulation, zero GFH relates to the same plane, but it also means that there are neighbor points that are part of the same plane. Instead of projecting the plane normal to each opt. DoF and evaluating the contribution, GFH values points on the plane's borders. This correlates well with our intuition that keeping points on the borders is better than keeping inliers since the border points encompass the same information, constrain the same or more directions, and can be better associated during an arbitrary movement. RMS maintains the full spectrum of local geom. distributions --- this ensures that most hyperplane/curved surfaces will be included to some extent (this improves robustness), even without projecting the normals. By skewing towards high-GFH points, we ensure that the potentially most valuable points will be sampled from each surface (all normal directions should remain covered). Please take a look at the image below for the biggest difference I see between these two approaches.
- I believe there are 3 factors for the improved performance: 1) the methodology, which maintains the optimum, 2) the lowered comp. latency, which provides faster (and in real-time systems, also better) convergence, and 3) stable compression rate, which reduces the number of comp. peaks and thus reduces the chance of real-time convergence faults.
- Right now, RMS is robust towards degeneracies only indirectly. I've seen it behave the best of the methods I've compared to, but it still is imperfect. The full GFH vector can relate better to the geometrical symmetricities (and vice versa) than the norm, but this remains part of future work.
- We've tried sampling, which would maximize the cumulative GFH norm. Although this approach could be fine-tuned for better performance, it was too sensitive to omnipresent noise and inaccuracies. RMS performs better by keeping points of various local distributions while skewing towards the ones potentially constraining the most DoFs (the border points).
- We have not compared it to Deschaud's method yet. I might try it to evaluate during the holidays. I will post the results here if I find time to do it.

Thanks for reading the paper. You've helped in discovering some paper bugs :) Let me know if something is still unclear, I'll be happy to answer.

Hey Pavel,

thank you for your detailed reply. Also happy new year to you! Sorry, I haven't been able to reply earlier due to the holidays. Hope you enjoyed yours as well.

You've got it right from what is in the preprint [...]

Thank you for the clarification, however, I still do not quite understand how to compute it in practice. Eq. 35 is confusing for me. It iterates over all non-empty subsets of Omega. So given 10 bins, it would iterate over: {1}, {1,2}, bin {1,2,3},...,{2},{2,3},{2,3,4}, ... Is that correct? And still assuming there is at least one point within each bin of H_Delta, wouldn't this always result in the exact same numeric value? Can you provide some example pseudo code on how to compute this?

Each bin contains all points lying in the given GFH-norm interval. [...]

Technically, each individual point within a bin has an individual floating point GFH value, so the chances of two points having exactly the same GFH values are basically zero. For example, in bin 0.1-0.2, there could be points with GFH [0.123, 0.1230001, 0.15, 0.18]. Two values might be really similar, but never identical. Thus, it seems that a secondary key would never matter here? Maybe you assume that all points within a common bin have approximately the same GFH value, and thus only consider the secondary key for the pop operation within a bin?

Correct, NN search is the most expensive part of the algorithm [...]

Alright. Now I also understand the sensor-specific computation of lambda_p... Basically, you use knowledge about the point distribution to make a more informed NN query to the KD tree.

The GFH of these points is 0. We do not discard these points; [...]

Ok. I could imagine them to be potentially valuable for very low resolution (far away) areas... But of course, a large number of noise points satisfies the same criterion, so maybe that counteracts potential positive effects...

No deskewing is included in our experiments --- we did not want an unnecessary feature to influence the comparison. [...]

Well "unnecessary" - most likely, yes, I agree ;-) But if you think about it more like in Deschauds approach for a second, where some certain points reduce uncertainty in certain DoF, then I could imagine that the question of doing this before or after (rough) deskewing could significantly(?) influence the estimated normal of a point, and the algorithm's perception of which DoF this point helps with. I think this could be the case for extremely aggressive motion only, and I am not sure if the same argument applies to your algorithm, since your algorithm is more local as you said, and basically throws away this "directionality" that a normal would have.

Honestly, this is the first time I'm seeing this work [...]

Thank you for your explanations. Indeed it would be interesting to see a comparison.

As for the intuition - I think what I was confused about is that you do not use the GFH vector that you mentioned, but only the norm. I simply guess that currently, with the use of the norm, enough points are being sampled such that enough DoF of the problem are covered well. This causes it to work even though there is no notion of directionality or location of the point in the sampling.

I believe there are 3 factors for the improved performance: 1) the methodology, which maintains the optimum, 2) the lowered comp. latency, which provides faster (and in real-time systems, also better) convergence, and 3) stable compression rate, which reduces the number of comp. peaks and thus reduces the chance of real-time convergence faults.

Just to make sure, because you mention "real-time convergence faults"... when computing APE, RPE etc and comparing sampling strategies... Do you give the algorithms a fixed compute budget or something? No, right? Then what is a real-time convergence fault for you? So say that runtime does not matter, you would still remain the lead for most datasets in the V40+ LOAM+KISS-ICP columns of Table III?

Thank you for your answers Frank

Hey Frank, Happy New Year to you, too :)

Thank you for the clarification, however, I still do not quite understand how to compute it in practice. ...

The mathematical notation in Eq. 35 might be confusing, but in practice, the max rate is computed once all the bins are traversed exactly once. What the Alg does in practice is

H_old, H_new = 1D histogram, 1D histogram
rates = []
for i in range(K, 0, -1):
   H_old.bins[i].sort() # on ordering later
   pt = H_old.bins[i].pop()
   if pt:
      H_new.bins[i].insert(pt)
   rates.append( rate of H_new (Eq. 32) )
max rate = max(rates)

We know that the maximum entr. rate of H_new is given in the <1, K> interval. For a fixed K, the max rate value is the same as you say, but only if all bins are non-empty. If there are any empty bins, it changes. Since we want to normalize to <0, 1> with at least one distribution having norm. rate of 1, we compute this max rate on the run, from the data. This is to condition (Eq. 31) relative to the data distribution and not to "an ideal case." Example: Note 12 1 2024 at 121659 jpeg(1)

Technically, each individual point within a bin has an individual floating point GFH value, so the chances of two points having exactly the same GFH values are basically zero. ...

Yes, in practice, the secondary key rarely triggers. Due to inaccuracies/errors and roughness of the surfaces, two GFH values are seldom identical. Nevertheless, it can happen, and the secondary key is there for completeness. Within a single bin, the points have differing GFH values - no artificial discretization happens.

Ok. I could imagine them to be potentially valuable for very low resolution (far away) areas. ...

Yeah, you're right. Also, the faraway points are useful for constraining rotations. But only if they are accurate, which is usually not the case in projective devices (3D LiDARs), where error increases with distance from the origin. From what I've seen, the points where some local geometry can be deduced (i.e., the points with neighbors) are more important in the task. And yeah, on super noisy data (e.g., large dust clouds), the method will most probably not function. But that is the case for all noise-unaware sampling methods, IMO.

Well "unnecessary" - most likely, yes, I agree ;-) But if you think about it more like in Deschauds approach for a second ...

I have to say that the deskewing influence remains to be tested. I come from a UAV lab where we often omit deskewing to reduce latency and to lower comp. requirements. But from what I've seen in the KITTI dataset, there are plenty of aggressive rotations (in one axis only though) where the lack of deskewing could have had an effect. And if it did, it had the same effect on ours and compared methods (btw Rusinkiewicz's work uses per-DoF sampling similar to Deschaud's).

As for the intuition - I think what I was confused about is that you do not use the GFH vector that you mentioned, but only the norm.

But your intuition is correct. If you look into X-ICP, there's a parallel between normals and the residuals used in the optimization. In X-ICP for example, the sampling respects the parallelism between residuals and the DoFs.

I figured during the experimentation that the positives (having the directions) of using the 3D vector on the input do not outweigh the negatives (slower and non-trivial multi-directional and possibly combinatorial sampling). As you say, the technique samples enough points that the DoFs are covered --- the prioritization of edges and locally rough surfaces and a slight redundancy on hyperplanes cover the desired directions. You may think of the algorithm as an approx. 3D corner detector with redundancy on lines and planes that certainly samples points on all planes and lines in the data (and thus extracts the maximum information given some point-sampling limit). And if all hyperplanes are covered, do we really need to know how these are oriented? For this method, knowing that the hyperplanes are all sampled is enough.

Just to make sure, because you mention "real-time convergence faults" ...

When saying "real-time", we mean that the estimation computes before another data frame arrives (so for 10 Hz data, the real-time limit is 100 ms). We have this limit there, but it's soft -- we give the methods an unlimited time and just measure the delay (this would be an issue in online runs for both the compared methods, which exceeded the real-time limit quite a lot). But inside the iterative algorithms (LOAM, KISS-ICP), there is still a termination criterion (e.g., no. of iterations, inner timing loop) that might terminate before convergence. And if these termination criteria are triggered and the output estimate is inaccurate, we call this a convergence fault. You may see this in 4:49 where the number of points during the fast rotation change leads to a complex opt. problem, which overwhelms the estimation, which then fails.

So say that runtime does not matter, you would still remain the lead for most datasets in the V40+ LOAM+KISS-ICP columns of Table III?

Yes.

Thanks for asking :)

Pavel

ctu-mrs / RMS

Questions on paper #1