Alignment of multi-camera pointclouds needs work

jackjansen commented 2 years ago

We need to fix the alignment once and for all.

~~The original issue is https://gitlab.com/VRTogether_EU/cwipc/cwipc_util/-/issues/41~~ but this link is dead now.

There is an old repo for experiments at https://github.com/cwi-dis/pointclouds-alignment

New experiment data is at https://github.com/cwi-dis/cwipc_test/tree/master/pointcloud-registration-test

Edit 20-Nov-2023: most of the text above is outdated. Kept for future reference.

Current plan of attack:

[ ] Create an algorithm that returns the approximate mis-alignment of each camera (as a distance in meters)
[ ] Implement a few of the alignment algorithms in a way that they can be run automatically
[ ] Create a registration algorithm that does something like
- Given the misalignment from step 1 run one or more of the alignment algorithms.
- Re-measure the misalignment. Check that it has gone down.
- Repeat until happy, or no more improvements
- record the results in cameraconfig.json. Also record there the current misalignment, because it is a good value for voxelization (later, during production).
[ ] After all this is done look at the other registration ideas @Silvia024 found, and possibly also that algorithm that @nachoreimat found.
[ ] After all this is done try to automate the coarse registration with Aruco codes
[ ] After all that is done see whether we can use multiple Aruco codes to handle a large number of cameras with fields of view that don't fully overlap.
[ ] If (big if) the registration procedure works with point clouds of people (as opposed to only working with point clouds of boxes, or not working at all) we should try applying it to the cwipc-sxr captures, to see if we can get better alignment.
[ ] In parallel to all of the steps above we should think about whether there's a paper in here somewhere, and who should be the primary person to lead this.

jackjansen commented 10 months ago

The problem may be that I "invert" the affine transform (the extrinsic matrix) by transposing it. That would work for a normal 3x3 matrix in our case (because we know all the matrices are size and shape preserving) but of course it doesn't work for an affine transform, I have to use another vector as the fourth column (and clear the fourth row).

Here is an explanation: http://negativeprobability.blogspot.com/2011/11/affine-transformations-and-their.html

jackjansen commented 10 months ago

And I found https://stackoverflow.com/questions/2624422/efficient-4x4-matrix-inverse-affine-transform which says a similar thing. Also, looking at the source code for open3d PointCloud.create_from_depth_image it seems to be that they're indeed doing this, but they have the eigen3d affine transform which has an invert. I guess I'll have to create that by hand.

jackjansen commented 10 months ago

Automatic coarse calibration based on Aruco markers is working. It's actually working so well that I have merged back into master, at a1b2855, so this can now be considered production-ready.

jackjansen commented 10 months ago

Multi-marker coarse alignment, for camera positions where not all cameras can see the (0, 0, 0) origin aruco marker so we use auxiliary markers, is also working good enough that I've merged it into master, at b923190

jackjansen commented 10 months ago

So, back to fine calibration or actually first to the analysis of the current calibration.

I'm working with the offline-boxes capture, because that shows the problem most clearly. I've created plots for all analysers that we have (one2all, one2all-filtered, one2all-reverse, one2all-reverse-filtered and pairwise). All the graphs are in cwipc_test.

The most informative graph for this dataset (but note the italics) is the pairwise graph:

captured-boxes-3_histogram_paired

The "camera numbers" here are the or of the two contributing cameras. As a human we can easily see that camera 1 is opposite camera 4 and camera 2 is opposite camera 8. And I also know this is correct, because I know that the cameras are placed in the order 1-2-4-8 clockwise. We can probably detect this algorithmically if we want.

As a human we can estimate the correspondences of the pairs:

1 to 2 (red): about 2cm
2 to 4 (olive): about 4mm
4 to 8 (turquoise): at least 4cm
8 to 1 (dark blue): less than 4mm

We can also see that the correspondence errors that the current "algorithm" (but really "quick hack" is a better term) has come up with are wildly wrong. Not surprising: the current "algorithm" works by finding the peak in the histogram and then move right until we get below a value that is less than 0.5*peak.

I will experiment with mean and stddev to see if I can get some more decent numbers. Then, if they work for this dataset, they should also be tried for the captured Jack dataset.

jackjansen commented 10 months ago

Mean and stddev by themselves are not going to work. Here are the results for the graph above:

camera 3: mean=0.02291982490598397, std=0.024645493357004753, peak=0.002320735082210554, corr=0.007816463821634046
camera 5: mean=0.04401491186517467, std=0.028213745113280272, peak=0.012704797435346417, corr=0.058097526853423245
camera 9: mean=0.016755202242697633, std=0.026204388020447566, peak=0.0018051266662951993, corr=0.002804512231245586
camera 6: mean=0.015378887548555181, std=0.023343767740899458, peak=0.001824420385722404, corr=0.003573071088354493
camera 10: mean=0.048489777693837444, std=0.028316377093057312, peak=0.0021639993720285154, corr=0.08227788490063927
camera 12: mean=0.0341007288961789, std=0.023926849570313418, peak=0.01910221049639875, corr=0.04007440525818792

The mean for the "good pairs" (6 and 9) is far too high.

And that is pretty logical, when you think about it: the long tails have an inordinate effect on the mean.

Next thing to try: first compute mean and stddev. then throw away all distances that are larger than (wild guess) mean+stddev, or maybe 2*mean. Then compute mean and stddev on the points that remain.

Edit: another thing to try is to keep only the points in the range [mean-stddev, mean+stddev].

The idea is that for the "bad pairs" this will throw away less of the points, but for the "good pairs" it will throw away more points.

jackjansen commented 10 months ago

Tried that. Also tried running the filtering multiple times, to see how the mean and stddev behave. Used the bracketing filter [mean-stddev, mean+stddev] on two premises:

It sort-of feels more mathematically correct,
it just so happens that for the "good pairs" std > mean so we don't throw away any "good points", while for the "bad pairs" std < mean so we throw away points on both sides, so running the filter successively should not change mean too much (while for the "good pairs" it will lower mean).

Here are the results:

camera 3: peak=0.002320735082210554, corr=0.007816463821634046
camera 3: 0 filters: mean=0.02291982490598397, std=0.024645493357004753, nPoint=81262
camera 3: 1 filters: mean=0.012951448749072387, std=0.010969866908687875, nPoint=66992
camera 3: 2 filters: mean=0.009752436971104068, std=0.005658494330190411, nPoint=52946
camera 3: 3 filters: mean=0.009181153840633022, std=0.0032857770409335132, nPoint=32361
camera 5: peak=0.012704797435346417, corr=0.058097526853423245
camera 5: 0 filters: mean=0.04401491186517467, std=0.028213745113280272, nPoint=44935
camera 5: 1 filters: mean=0.04137415890726912, std=0.016101593626257675, nPoint=26525
camera 5: 2 filters: mean=0.04045734373520693, std=0.009445048042243229, nPoint=15533
camera 5: 3 filters: mean=0.040051235318581284, std=0.005529610745340443, nPoint=8873
camera 9: peak=0.0018051266662951993, corr=0.002804512231245586
camera 9: 0 filters: mean=0.016755202242697633, std=0.026204388020447566, nPoint=81399
camera 9: 1 filters: mean=0.006063694689290081, std=0.008691325161748644, nPoint=67977
camera 9: 2 filters: mean=0.003177805834766034, std=0.0024780598363417766, nPoint=59915
camera 9: 3 filters: mean=0.002515016672254931, std=0.0011297240188485047, nPoint=52181
camera 6: peak=0.001824420385722404, corr=0.003573071088354493
camera 6: 0 filters: mean=0.015378887548555181, std=0.023343767740899458, nPoint=93198
camera 6: 1 filters: mean=0.006850735824829795, std=0.00853339253154322, nPoint=79885
camera 6: 2 filters: mean=0.003812802111714147, std=0.0027496213518108494, nPoint=69301
camera 6: 3 filters: mean=0.003089897111975529, std=0.00140128710337592, nPoint=56900
camera 10: peak=0.0021639993720285154, corr=0.08227788490063927
camera 10: 0 filters: mean=0.048489777693837444, std=0.028316377093057312, nPoint=43545
camera 10: 1 filters: mean=0.04792050129221857, std=0.016245794338419692, nPoint=25355
camera 10: 2 filters: mean=0.04781840036709909, std=0.00938153612816915, nPoint=14709
camera 10: 3 filters: mean=0.04774644484942719, std=0.005421145457380683, nPoint=8482
camera 12: peak=0.01910221049639875, corr=0.04007440525818792
camera 12: 0 filters: mean=0.0341007288961789, std=0.023926849570313418, nPoint=73397
camera 12: 1 filters: mean=0.028625500662925147, std=0.011632248990853532, nPoint=52026
camera 12: 2 filters: mean=0.027763304899151773, std=0.0064978872262983645, nPoint=33259
camera 12: 3 filters: mean=0.027708708994364492, std=0.003789831220465292, nPoint=19513

This seems to be going in the right direction: the "bad pairs" (opposing cameras) have their mean staying put at high values. The "good pairs" have their mean going down significantly, towards what appears to be a correct value. The "not so good pairs" (3 and 12) also seem to end up at decent values.

jackjansen commented 10 months ago

Partial success. That is to say: this works pretty well for camera-pair measurements on the boxes:

captured-boxes-3_histogram_paired

These are pretty beleivabe numbers!

Unfortunately it does not work well at all for the one-to-all-others measurements:

captured-boxes-3_histogram_one2all

I think the problem is that this algorithm throws away any points that it can't match (which, in case of this dataset, includes the mismatched "edges that are sticking out".

Let's first check how the pair-wise measurements work on the other datasets.

jackjansen commented 10 months ago

That didn't work very well. I've now made the pair-wise measurement symmetric but this needs work: at the moment it is far too expensive.

And it is also too aggressive in trying to put as many points into the overlapping set as it can. Can be seen with the loot datasets.

We should somehow re-enable the max_distance topping of the kdtree distance finder (I disabled it for now) but still count the points that go over it.

jackjansen commented 10 months ago

For future reference: when we get back to finding the "best" algorithm to align the pointclouds we should look at point-to-plane ICP with a robust kernel. From https://www.open3d.org/docs/latest/tutorial/pipelines/robust_kernels.html#Vanilla-ICP-vs-Robust-ICP I get the impression that the robust kernel is a way to deal with noise. The referenced page uses generated noise, but of course our sensors are also noisy...

jackjansen commented 9 months ago

Copied from Slack:

Folks, in your research of registration algorithms, have you come across any that allow "pinning" of one of the variables? I.e. ask the algorithm to find the optimal transformation but specifying, for example, that the y-translation must be zero? Because if that exists then we could do fine calibration in two steps:

First do a fine calibration of the empty capture. This will align all the floors. Then assure that the floors also fall in the plane y=0.
Next do the fine calibration with boxes or people or whatever, but pin y=0.

Actually, thinking a bit more, we not only want to pin the y-translation to 0 but also the x-rotation and z-rotation. So the only free variables should be y-rotation, x-translation and z-translation.

ireneviola commented 9 months ago

We might want to change the loss function as to only consider a 2D error - which would effectively mean that in every iteration, the algorithm would be forced to change only the parameters that are considered in the loss function, because the others would have no impact in the error. There might be other ways of writingit as an optimization problem, though; we should check it out

jackjansen commented 2 months ago

The fixer managed to make them all upright, but the's where the good news stops. They're still off by quite a bit, and moreover (and worse): the analysis algorithm produces way too optimistic values.

Inspecting this issue again after half a year of inactivity, but a lot of actually using the current registration setup in production. The comment quoted above (from 11-Dec-2023) seems to be the main thing that is bothering us most at the moment: Often, when running cwipc_register --fine, the script will report that it has managed to align all point clouds to within a few millimeters. But actual inspection of the captured point cloud clearly shows that some areas (and often important areas like the head) are off by 5-10cm.

The "solution" we are currently using is to simply try again with the subject human in a different pose, and hoping for the best.

Fixing this, or at least showing the operator something (for example a graph of the p2p distance distribution) from which they can tell this has happened is at the moment of paramount importance.

jackjansen commented 2 months ago

Once we have addressed the issue above (the bad numbers coming out of our analysis) my feeling is that we should move to a "mixed" upper strategy. Right now our upper strategy is either pairwise or one-to-all-others, but maybe we should first do one round of pairwise, and after that a round of one-to-all-others.

If we do the pairwise round in the right order (i.e. most overlapping pair first) I think that should get us out of the "local minimum problem" with the boxes.

The right order should be easy to compute: for each pair, compute the upper bound of the percentage/fraction of points that could possibly overlap. High-to-low is the right order.

cwi-dis / cwipc

Alignment of multi-camera pointclouds needs work #18