Alignment of multi-camera pointclouds needs work

jackjansen commented 2 years ago

We need to fix the alignment once and for all.

~~The original issue is https://gitlab.com/VRTogether_EU/cwipc/cwipc_util/-/issues/41~~ but this link is dead now.

There is an old repo for experiments at https://github.com/cwi-dis/pointclouds-alignment

New experiment data is at https://github.com/cwi-dis/cwipc_test/tree/master/pointcloud-registration-test

Edit 20-Nov-2023: most of the text above is outdated. Kept for future reference.

Current plan of attack:

[ ] Create an algorithm that returns the approximate mis-alignment of each camera (as a distance in meters)
[ ] Implement a few of the alignment algorithms in a way that they can be run automatically
[ ] Create a registration algorithm that does something like
- Given the misalignment from step 1 run one or more of the alignment algorithms.
- Re-measure the misalignment. Check that it has gone down.
- Repeat until happy, or no more improvements
- record the results in cameraconfig.json. Also record there the current misalignment, because it is a good value for voxelization (later, during production).
[ ] After all this is done look at the other registration ideas @Silvia024 found, and possibly also that algorithm that @nachoreimat found.
[ ] After all this is done try to automate the coarse registration with Aruco codes
[ ] After all that is done see whether we can use multiple Aruco codes to handle a large number of cameras with fields of view that don't fully overlap.
[ ] If (big if) the registration procedure works with point clouds of people (as opposed to only working with point clouds of boxes, or not working at all) we should try applying it to the cwipc-sxr captures, to see if we can get better alignment.
[ ] In parallel to all of the steps above we should think about whether there's a paper in here somewhere, and who should be the primary person to lead this.

jackjansen commented 1 year ago

Current best procedure (copied from Slack):

I think I've just done the best calibration we've ever done. Keeping note here so we can try again (in other words: @shirsub can try this procedure in vrsmall).

Use 4 cameras, not 3.
First do the coarse calibration cwipc_calibrate --kinect --nofine with the A4 sheet on the floor.
Edit cameraconfig, set correct high/low, near/far, radius, erosion=2
Do cwipc_view --kinect with nothing in view, and ensure there are no points. If there are points edit cameraconfig to fix.
At the centerpoint stack 3 IKEA boxes so each camera sees 2 sides (and each side is seen by two cameras)
cwipc_view --kinect and w to save a pointcloud.
cwipc_calibrate --kinect --reuse --nocoarse --nograb pointcloud_xxxxxxxxx.ply --corr 0.05
(The 0.05 is how far the per-camera pointclouds are apart in the grabbed image. For many years I picked lower numbers here all the time, but the trick is you should pick higher numbers.)
Pick algorithm 6 (cumulative multiscale ICP), variant 2 (point2plane)
Inspect result, after - a few times to make the points smaller. If you look at the stack of boxes from above you get a very good idea of the quality.
The way the algorithm works at the moment means (I think) it will be happy after doing 3 steps so with the new calibration I did another grab and calibrate with --corr 0.02 which gave me a pretty good result (at least I think so).

jackjansen commented 1 year ago

I think we should start looking at opencv to help with calibration. For example, we could use an Aruco marker (see https://docs.opencv.org/4.x/d5/dae/tutorial_aruco_detection.html) to help with detection of (0, 0, 0) and do the coarse calibration automatically.

This also gives us camera positions, so we can then determine the camera ordering for the next step (which pairs of cameras we should align).

We could then capture a 3D object (maybe the boxes, but possibly a person is good enough). I have the feeling that we could use the histogram of point-to-point distances to determine how well the coarse calibration worked: I'm expecting a plateau on the left of the histogram, then a drop, then a very long tail. Somewhere along the drop would seem to be a good value for the --corr parameter.

We then use cumulative multi scale ICP or something, and after that recompute the histogram. corr should now be lower.

Then I think we repeat with the other camera pairs.

Question is what to do after all pairs have been done, because possibly the last calibration has broken the first one.

Another option could be to first compute the corr for each pair, and use that to determine the order of doing the aligning, but I'm not sure whether "best first" or "worst first" is the best option. I think "best first" but I'm not sure.

But I do think that after we've finished calibration we should repeat the calculation of corr, because this will give us the optimal cell size for voxelization.

@fonskuijk I would really like your opinion on this idea...

jackjansen commented 1 year ago

Edit 20-Nov-2023: most of the information in this comment is now outdated. See below. Keeping it here for future historic reference.

Okay, let's use this issue to keep note of what we've done (and found out) so far.

There are 6 pointclouds in cwipc_test that have been grabbed in vrbig and vrsmall. One of the stack of boxes, one of Jack facing the front camera and one of Jack facing between the front camera and the next camera.
From visual inspection the registration in vrbig looks pretty good, the registration in vrsmall has one camera off by a few centimeters.
There's a script sandbox/voxelize_curve.py (now badly named) that vowelises a point cloud by ever decreasing voxelsize. It creates a CSV file with an entry per voxelsize, showing how many points there are, how many points with 1 contributing camera, how many points with 2 contributing cameras, etc.
Each voxelized point cloud is also saved.
The script has been run on the two boxes pointclouds, after colorising them per camera, i.e. after running --filter 'colorize(1,"camera")' on it.
The resulting CSV files can be imported into voxelize_curve_spreadsheet.numbers to graph them.
(note that various fixes were made to cwipc_view and cwipc_downsample() to make this all work somewhat)

The results are difficult to interpret. I was expecting to see some sort of a threshold effect between cellsize > epsilon and cellsize < epsilon, where epsilon is how good the current registration is (in other words: when the cellsize is so large that points from camera 1 and camera 2 would be merged I would expect a super-linear drop in the point count.

This doesn't happen. except for very small and very large cell sizes I'm seeing a more-or-less linear decrease in the number of points (of approximate a factor of 2, as is to be expected as I'm using sort(0.5) as the factor between cell sizes.

Next I decided to look at the number of contributing cameras for each result point (after voxelizing) at each cellsize. Again, I was expecting to see something interesting at the boundary of cellsize == epsilon: something like a sudden increase in the number of points with two contributing cameras (at the expense of the number of points with one contributing camera).

But again no such luck: things look vaguely linear. Moreover, already at small cellsizes there are quite a few points that already have three contributing cameras. This should not be possible with the boxes, I think: I think the alignment of the cameras and the boxes was such that each side could really only be seen by two cameras.

Either I am making a stupid mistake in my thinking, or there is another bug in the way cwipc_downsample() works. I will investigate the latter first.

Edit: there was indeed another stupid bug in cwipc_downsample() which led to sometimes too many contributing cameras being specified in a point.

Here is the correct version of the previous graph: Screenshot 2023-11-15 at 17 01 50

jackjansen commented 1 year ago

Came up with a wild plan (thanks @Silvia024 !).

What if we take synthetic point cloud, and then create 4 point clouds from that (North, East, South, West), which each of the new point clouds having half the points of the original (basically the ones that can be seen from that direction).

Now we can slightly perturb each of the NESW point clouds, and recombine them into a single "4-camera" point cloud.

This we can now present to voxelize_curve.py and see what it reports. And we know the ground truth, because we have done the perturbations.

So this should allow us to see what the magic numbers are that we are looking for.

Moreover, this scheme (or variations of it, for example introducing some noise) could also be used to do automatic testing of our point loud registration algorithms (once we have written them:-).

jackjansen commented 1 year ago

That was a bit of a disappointment, so far. Or actually maybe it just showed our erronous thinking:-)

Implemented the generation of registration-test pointclouds, with a settable offset for camera 2 and 4. (1 and 3 will be correct).

Created pointclouds with an error of 0.03 and 0.003.

The results are in https://github.com/cwi-dis/cwipc_test/tree/master/pointcloud-registration-test but the quick message is that the curves still look pretty linear. There is a clear "minimum voxelsize" where points from two cameras where points from multiple cameras start to be combined, but this "minimum voxelsize" is quite a bit smaller than the pre-determined error with which the point clouds were generated.

I could say something (based on the data for these pointclouds) like: "As soon as more than 15% of the points come from two cameras our voxelsize is larger than the error" but this would just be data-fitting.

@Silvia024 can you have a look at the data?

jackjansen commented 1 year ago

Another wild idea. Comments please.

For each voxelsize we voxelize the whole pointcloud, just as we do now. But we also take the 4 per-camera pointclouds and voxelize those individually. We add the pointcounts of those cloud.

We now compare the two numbers. This comparison is a measure for how much camera-combining has happened.

Edit 20-Nov-2023: We decided not to go down this path, because we dropped the idea of using voxelization to compute alignment quality.

jackjansen commented 1 year ago

~~And yet another wild idea. We are now doing this on the whole pointcloud, maybe that is a bad idea because it confuses our numbers.~~

~~What if we construct the 6 pairwise point clouds (cam1+2, cam1+3, etc) and run this analysis separately on each pairwise pointcloud?~~

Edit: this did not work. I experimented with it by creating the generated point clouds (0.03 and 0.003) with only two cameras (dropping the other two). The curves are just as meaningless as for the full 4-camera cloud.

I am pretty sure that the same is true for the previous wild idea.

Silvia024 commented 1 year ago

maybe tomorrow we can chat better about this?

btw I just found this: https://github.com/IntelRealSense/librealsense/issues/10795 and it seems quite relevant and it might be of interest. I will try to read it tonight

Silvia024 commented 1 year ago

I found two interesting papers...

https://www.sciencedirect.com/science/article/abs/pii/S0143816618308273 here I found interesting that they use a cube with chessboard patterns on four sides as calibration object so they know all the measurements (e.g., distance between the edge of chessboard pattern). They put the cube such that one camera capture mainly only one side of the cube. What it's interesting to me is that to verify the reconstructed point clouds, they check if two adjacent side of the cube are orthogonal;
https://dl.acm.org/doi/abs/10.1145/3570361.3592530?casa_token=sNaOGgeRC5IAAAAA:4_WVQOtqLz9sVcqmvRoBOMKKBsBPZX6vtJjFrNd553VoJes5l49chIb8ytQejYnJUJWjCtdPMcKW this one is very recent (just published at MobiCom!) and I think they are doing what we also would like to have in our system :"We propose Dynamic Camera Calibration (DCC) in MetaStream to address this limitation. The DCC module dynamically obtains and manages the camera pose in real-time and calibrates cameras without any pre-defined makers".

jackjansen commented 1 year ago

Updated two previous comments (the two wild ideas): they turned out not to work. I have also experimented with changing the "next cellsize factor" from sqrt(0.5) to sqrt(sqrt(0.5)) but there is no appreciable bump around the known epsilon with which the point clouds are created.

I now think that the whole idea of using voxelization at successively smaller sizes and looking for a discontinuity simply doesn't work.

To come up with a theory: voxelization is essentially a stochastic procedure, because it depends on the origin of the grid and the individual original points have no relationship to that grid. So whether or not two points are combined if they are less that cellsize apart is a chance: if cellsize > 2*distance they are definitely combined, if cellsize < distance/sqrt(3) they are definitely not, but the area in between is grey. @Silvia024 does this sound reasonable?

@Silvia024 @troeggla please up-vote this comment if you agree that voxelization is a dead end. Then I can remove all the generated data and spreadsheets over in cwipc_test/pointcloud-registration-test (which is quite a lot).

jackjansen commented 1 year ago

My suggestion for the next try is to compute foreach point in cam1cloud the minimum distance to any point in cam2cloud. First thing to look at is the histogram of these distances. (April 9 comment). I hope we would see a very large peak just below epsilon with a very long tail (all the cam1 points that have no corresponding cam2 point).

I think the KD-Tree is the data structure we want. And it appears to be supported in NumPy, example code is in https://medium.com/@OttoYu/exploring-kd-tree-in-point-cloud-c9c767095923

jackjansen commented 1 year ago

The idea of using KDTree and plotting a histogram of the mimic distances looks very promising.

There's a Jupyter notebook in cwipc_test, I'm going to convert it to a script.

Silvia024 commented 1 year ago

Updated two previous comments (the two wild ideas): they turned out not to work. I have also experimented with changing the "next cellsize factor" from sqrt(0.5) to sqrt(sqrt(0.5)) but there is no appreciable bump around the known epsilon with which the point clouds are created.

I now think that the whole idea of using voxelization at successively smaller sizes and looking for a discontinuity simply doesn't work.

To come up with a theory: voxelization is essentially a stochastic procedure, because it depends on the origin of the grid and the individual original points have no relationship to that grid. So whether or not two points are combined if they are less that cellsize apart is a chance: if cellsize > 2*distance they are definitely combined, if cellsize < distance/sqrt(3) they are definitely not, but the area in between is grey. @Silvia024 does this sound reasonable?

@Silvia024 @troeggla please up-vote this comment if you agree that voxelization is a dead end. Then I can remove all the generated data and spreadsheets over in cwipc_test/pointcloud-registration-test (which is quite a lot).

the theory seems correct to me :) agree that maybe it's not a good strategy

jackjansen commented 1 year ago

I'm getting beautiful graphs, plotting (per camera pair) the cumulative distance-to-other-cloud.

cumdist

jackjansen commented 1 year ago

All the graphs are now in cwipc_test. They look mostly usable. I'm going to change the Y axis from being counts to being fractions, that should make them even easier to interpret (I hope).

jackjansen commented 1 year ago

For reference in further discussion below: here is the current state of the graph for the boxes in vrsmall (with one camera about 1cm off):

boxes_cumdist

And here is one of the Jack point clouds (the best one, the sideways one):

jack-sideways_cumdist

And here is the one for the generated point cloud, with a 0.03 m offset of two of the cameras:

genregtest03_cumdist

jackjansen commented 1 year ago

First thing I would like to do is convert each of these lines into a number correspondence. This number should be an upper bound for the registration error between the two cameras in that pair, for the given capture. If we are pretty sure the cameras have no significant overlap in field of view (pairs (0, 3) and (1, 2) in the graphs above) we return an invalid value (NaN or None or something like that).

Once we have this number we can do two things with it:

Use it as the --corr parameter to the various registration algorithms we have
After applying an algorithm we can recompute this correspondence number and ascertain it is lower than the previous value.

jackjansen commented 1 year ago

Idea: we should look at the derivative of these lines (which is actually the distance histogram data) and the second derivative. Possibly after smoothing the lines a little.

The second derivative should start positive, then go negative, then go positive again. Maybe that last point is the number we're interested in.

jackjansen commented 1 year ago

Unrelated idea (@Silvia024 comments please): we are currently looking at the ~~symmetric~~ combined difference between camera pairs.

Would it be better to do a per-camera comparison to all other cameras combined? I.e. not compute and graph the distance per camera pair, but in stead look at the distance of the points of camera N to the points of all other cameras together?

That should extend the "tail" of the histogram, probably making it more linear-ish. Also, the resulting number correspondence should be more useful, because it's a number that pertains to a single camera (in stead of to a pair of cameras).

jackjansen commented 1 year ago

Would it be better to do a per-camera comparison to all other cameras combined?

Implemented, and it seems to produce much better results. So much better that We need to check that we are not looking at bullshit data.

Here are the graphs (for single-camera-to-all-others) that are for the same dataset as the pair graphs of yesterday:

boxes_cumdist_one2all

jack-sideways_cumdist_one2all

genregtest03_cumdist_one2all

jackjansen commented 1 year ago

And here are the histogram graphs of the same datasets:

boxes_histogram_one2all

jack-sideways_histogram_one2all

genregtest03_histogram_one2all

Silvia024 commented 1 year ago

The original issue is https://gitlab.com/VRTogether_EU/cwipc/cwipc_util/-/issues/41

@jackjansen, I can't access to this link. It says page not found. Maybe do I not have the right to access it?

jackjansen commented 1 year ago

The original issue is https://gitlab.com/VRTogether_EU/cwipc/cwipc_util/-/issues/41

@jackjansen, I can't access to this link. It says page not found. Maybe do I not have the right to access it?

That link is dead (the gitlab site disappeared when we stopped paying for it).

jackjansen commented 1 year ago

Getting started with a guess at the correspondence (when the histogram is "over the hill", and has descended about half way down). Looks good for the generated point clouds, and usable for the captured ones (in the right ballpark).

boxes_histogram_one2all

jack-sideways_histogram_one2all

genregtest03_histogram_one2all

jackjansen commented 1 year ago

I just realised that the correspondence error alone is not good enough to determine in which order we should attempt to re-align the cameras. I previously thought it would be enough: just re-align the cameras starting with the worst one.

But I now think we also want to know how many points had a distance below the correspondence error: if camera 1 has corr=0.01, n_corr_points=20 and camera 2 has corr=0.005, n_corr_points=20000 it's probably a much better idea to start with camera 2.

Silvia024 commented 12 months ago

Some points from our meeting this morning.

On our slack channel there is pinned the folder with initial papers for our literature review,
TO DO: check this repository https://github.com/XuyangBai/awesome-point-cloud-registration and extract algorithms that could be applied to our settings;
TO DO: list of metrics to evaluate the final registration

jackjansen commented 11 months ago

Yeah! Implemented the first registration algorithm (ICP point-to-point) and it's already a great success!

Here are the graphs for the worst dataset (Jack frontal in vrsmall):

jack-forward_histogram_one2all

jack-forward_histogram_after_step_5

Algorithm used:

find worst-aligned camera and its correspondence
repeat {
  re-align that camera using ICP-point2point
  re-create the pointcloud with the new alignment for that camera
  find worst-aligned camera and its correspondence
} until (no improvement since last step)

Observations:

All captured datasets seem to end with alignment errors of around 3.5 mm. I wouldn't be surprised if this turns out to be the resolution of the Kinects at the given distance between camera and subject (around 2 m I think). Anyone willing to compute this number?
I just realised that we should always use map color to depth for our use case (inwards looking at people), and never map depth to color. Because the latter will invent depth values through interpolation, which is not only harmful at the boundaries of the person (interpolation to the depth of the wall behind), but it also really won't help anything: it will really only work for predominantly flat surfaces with a lot of details in RGB (because the higher resolution of the RGB camera).
A result of the previous point is that I don't think the ICP variants that take color into account are going to provide any benefit for us. @Silvia024 @ashutosh3308 what do the scientists think?
It think ICP-point2plane may be able to do better than the 3 mm we are getting now, because the "plane" will be determined using some form of interpolation, and therefore the camera resolution at the given distance should play less of a role. @Silvia024 @ashutosh3308 what do the scientists think? And more importantly: is there anything about this in the literature?
"worst aligned camera" is currently determined by multiplying measured mis-alignment by the number of points in the alignment set. As you can see in the bottom graph above this is a bit too aggressive: camera 2 has a mis-alignment that is worse than camera 1 but it wasn't selected. We should try sqrt(nAlignmentPoints) or log(nAlignmentPoints) as the factor to see whether it makes a difference.

jackjansen commented 11 months ago

Yet another wild idea, this time on the "ground truth" dataset we use (both for testing and for the eventual writing of the paper).

If we start with one of the loot pointclouds and split it into two tiles by randomly assigning each of the points to a tile we know exactly what the measured correspondence should be: it is the cell size of the original 1024*1024*1024 ply-file that we created the pointcloud from (the 8i files use integer coordinates, so we know the cellsize exactly).

jackjansen commented 11 months ago

"worst aligned camera" is currently determined by multiplying measured mis-alignment by the number of points in the alignment set. As you can see in the bottom graph above this is a bit too aggressive: camera 2 has a mis-alignment that is worse than camera 1 but it wasn't selected. We should try sqrt(nAlignmentPoints) or log(nAlignmentPoints) as the factor to see whether it makes a difference.

That seems to be a good idea, with log() working better than sqrt(). Here are the results for jack-sideways, first with using nAlignmentPoints in the weight, then with using log(nAlignmentPoints). We not only did an extra step in the latter case, but also aligned different cameras at the previous steps. You can see that the latter got to a better result eventually.

But: we may be able to get even better results through completely different means. For example: if our previous algorithm didn't get anymore improvement (the end condition of the loop above) we just re-align all cameras once in no particular order. And if that resulted in any improvement we continue with our loop.

jack-sideways_histogram_after_step_6_nPoints

jack-sideways_histogram_after_step_7

jackjansen commented 11 months ago

Note for the future: at the end of the re-registration run we have a correspondence. We should voxelize the pointcloud to this number (or possibly some function f() of this number). The expectation is that this vozelized pontcloud will nicely have all the "synthesised cameras" for the tile overlaps. We can of course visually inspect this resulting point cloud, but we can probably also compute how well the whole process worked by running the registration analyser on it.

This pointcloud will have 15 virtual cameras, and the correspondence between those cameras should be very bad. Actually, the number to look at now is not so much the correspondence error but the number of points in the correspondence set: we are expecting only the points that are "on the edge" of the tile to have a low distance to the other tile clouds.

jackjansen commented 11 months ago

But: we may be able to get even better results through completely different means. For example: if our previous algorithm didn't get anymore improvement (the end condition of the loop above) we just re-align all cameras once in no particular order. And if that resulted in any improvement we continue with our loop.

Implemented this (looping over the other cameras after the worst camera doesn't show anymore improvement), and it turns out not to be a good idea.

In essence the only thing it does is to move some "badness" from one camera to another. So, each time through the loop the camera that's being re-aligned will improve a bit, but other cameras will get a bit worse. I'm not sure I understand why this happens...

And there's a problem about determining when to terminate.

Silvia024 commented 11 months ago

Observations:

All captured datasets seem to end with alignment errors of around 3.5 mm. I wouldn't be surprised if this turns out to be the resolution of the Kinects at the given distance between camera and subject (around 2 m I think). Anyone willing to compute this number?

I think so, I found that it should be: "~ 3 mm per pixel x by y at 2 m"

A result of the previous point is that I don't think the ICP variants that take color into account are going to provide any benefit for us. @Silvia024 @ashutosh3308 what do the scientists think? need to check this. I don't have an answer now

It think ICP-point2plane may be able to do better than the 3 mm we are getting now, because the "plane" will be determined using some form of interpolation, and therefore the camera resolution at the given distance should play less of a role. @Silvia024 @ashutosh3308 what do the scientists think? And more importantly: is there anything about this in the literature?

From a quick research, I see that ICP-point2point has generally worse performance than ICP-point2plane but is fast to compute. Will check better in the literature

ireneviola commented 11 months ago

Ideas: -first use geometry-only ICP, and then try with color to see if it gives any improvement @jackjansen -try to check temporal consistency to improve alignment dynamically -Publication idea: testing that the alignment metric works + benchmarking a bunch of registration algorithms -With lots of data, see if we can bring the precision error to a minimum by optimizing over the matrices

Silvia024 commented 11 months ago

This is the path for calibration algorithms: python/cwipc/scripts/cwipc_calibrate/calibrator.py (https://github.com/cwi-dis/cwipc_util/blob/18-alignment/python/cwipc/scripts/cwipc_calibrate/calibrator.py) - 18-alignment

jackjansen commented 11 months ago

Implemented ICP_Point2Plane, but it seems to be performing slightly worse than ICP_Point2Point.

I can see two possible reasons for this:

Our analysis (the correspondence computing) is not good enough (or actually: its precision is not good enough). The analysis basically uses Point2Point distance, so I can imagine that the values after ICP_Point2Point and ICP_Point2Plane both fall within the precision of the analysis.
The Point2Plane realignment estimates normals for each point, and this estimation is done using two magic values that determine how many neighbouring points are used to estimate the normals. The two magic values are distance and nNeigbhour. I have no idea where the values of 0.02 and 30 that we currently use come from. They may be wildly incorrect for our use case (human bodies).

I think the best way forward is to first create the loot-based test data set, described above.

jackjansen commented 11 months ago

Created the "simple" loot-based dataset (i.e. loot split into 4 halves). This showed enough problems with the algorithms that it kept me busy for a day:-)

One of the problems is that the "outer algorithm", the one that selects the camera to re-align, can start oscillating: fix camera 2 but thereby making camera 4 worse, then fixing camera 4 making camera 2 worse, etc. Currently the stop condition is based on the error of the worst camera (taking weight into account), maybe the stop condition should be based (or also based?) on the overall error?

And there is a problem of ever-diminishing returns for an extra run (which showed up with loot). This has been fixed for now by stopping if the error is less than 0.001, but this number should really be based on the precision of the source point cloud.

jackjansen commented 11 months ago

And from a practical point of view: Maybe now is the time to turn cwipc_calibrate and the sandbox/registration_error_* code into a few classes, and create a new script cwipc_register that uses those, to replace cwipc_calibrate.

Initially the classes would use the current method of coarse calibration (selecting of the points), but the work on automating that could then happen in parallel.

jackjansen commented 11 months ago

Implemented part of

And from a practical point of view: Maybe now is the time to turn cwipc_calibrate and the sandbox/registrationerror* code into a few classes,

and also

Note for the future: at the end of the re-registration run we have a correspondence. We should voxelize the pointcloud to this number (or possibly some function f() of this number). The expectation is that this vozelized pointcloud will nicely have all the "synthesised cameras" for the tile overlaps.

The results are somewhat disappointing: for reasonable choices of f() (I tried factor*correspondence, for factors sqrt(0.5), 1, sqrt(2), 2) there are not many points in the "intermediate" tiles. About 10% or so when using 1*correspondence. I have to go up to a fact or 4 or so before a significant number of points get into the new "synthesised intermediate camera" tiles. Even for the boxes dataset.

I have no idea why this happens, nor a good idea of how we could investigate it.

jackjansen commented 11 months ago

Got started on playing with Aruco markers. Created a dataset vrsmall-noreg that has a capture of aruco-markers.

Looks like it's absolutely feasible to find the markers in the RGB images, but the (u,v) indices in RGB-image-space cannot be converted to (x, y, z) coordinates in PLY-space easily.

Three possible paths forward:

Ensure that we can do the conversion above, somehow.
Don't work from the RGB images, but in stead create an RGB image from the PLY file, by projecting it onto a plane.
- We could use a plane perpendicular to the camera Z axis (which would give us the image most closely matching what the camera actually saw, or
- we could use the floor as a plane, if we can find it.
Use an algorithm for finding Aruco-markers in point clouds, if one exists. @Silvia024 is going to have a look what she can find.

Also found a message from someone trying to do similar things to use: https://forum.openframeworks.cc/t/align-multiple-kinect-point-clouds-with-aruco-markers/41182

jackjansen commented 11 months ago

Interesting results also on fine-calibration.

Created a dataset vrsmall-worseafter tilting three cameras in three different planes (x,y,z). The resulting point clouds are clearly off.

The fixer managed to make them all upright, but the's where the good news stops. They're still off by quite a bit, and moreover (and worse): the analysis algorithm produces way too optimistic values.

That is the first problem to fix. It s clearest from examining the graphs in vrsmall-worse/boxes and comparing the computed correspondence values to what a human can see from the plots: the values are way too optimistic.

The second thing to think about is the floor: whether we should include the floor in the capture. vrsmall-worse/boxes shows a reason to do so: it would probably have forestalled that the camera 4 point cloud (the green one) is positioned "too high". But on the other hand: having a lot of "floor points" matched may lead to the overall algorithm to be happy too sone ("I've matched up all these floor points, who cares about a few person points").

jackjansen commented 11 months ago

the analysis algorithm produces way too optimistic values.

Let's start by analysing this problem. Here's the initial distances histogram for vrsmall-worse/boxes:

boxes_histogram_one2all

Visually it is clear that both camera 1 (blue) and 2 (orange) have a correspondence error of approximately 0.1, but the analysis algorithm gives a value of around 0.04.

Also, camera 8 (red) seems to have preciously few points in the correspondence set, making that result not very trustworthy. Similar to a lesser extent for camera 4 (green).

jackjansen commented 11 months ago

Wild idea: part of the problem is that we are computing correspondence on the whole point clouds S (source, the points from the camera that we are interested in) and D (destination, all the other camera point clouds combined).

If we could in stead compute the subset point clouds Soverlap and Doverlap consisting of the points that could reasonably be expected to be in the overlap between S and D this should help the computation of correspondence a lot: For each point in Soverlap we compute the distance to the nearest point in Doverlap, then compute mean and stddev of those distances. Those two numbers should allow is to compute a good candidate for the correspondence (probably something like mean + stddev/2 or so, but I'll leave that to the mathematicians:-).

Question to the scientists: does this make sense so far?

If the answer is yes, then here is another wild idea on how to compute Doverlap (and the same would work for Soverlap).

nNeighbour = 4 (or maybe 3, or 8, but some reasonably small value)
For each point in D add a field usage_count
for each point in S:
    compute the nNeighbour nearest neighbours in D
    for each of those nNeighbour neighbours
        increment usage_count
Remove all points from D that have usage_count == 0
Remove all points from D that have usage_count > nNeighbour

Rationale: the points in D with usage_count == 0 are apparently very far away from any point in S (no point in S selected them as a close neighbour), so we can get rid of them. On the other hand, the points with usage_count > some threshold are apparently very close to S, because many points in S that do not have any "natural neighbours" in D have selected them as one of the closest. So we should get rid of these points as well, because they will skew the mean and stddev later. Using nNeighbour as some threshold is a wild guess.

Scientists: does this sound reasonable?

jackjansen commented 11 months ago

I guess today is a wild-idea-day:-)

Got started on playing with Aruco markers. Created a dataset vrsmall-noreg that has a capture of aruco-markers.

I was thinking on how we could create images to search from the point clouds (as opposed to using the captured images "aux data", because we don't always have these, and also because as stated above it is difficult to convert the image (u, v) coordinates back to spatial (x, y, z) coordinates).

What if we create three planes, based on the camera coordinate system: x=0, y=0 and z=0. On each of these planes we make two distinct projections, for example for the x=0 plane we make one image where we project all points with x>=0 and ignore points with x<0, etc. (Actually: we need only 5 planes, because for z=0 one of the planes would be empty as the camera can't see behind itself).

I guess we can initialise each of the planes to something funny (green?) and hope the aruco-finding algorithm will work around that.

jackjansen commented 11 months ago

And if it turns out that the previous idea doesn't work: we could try to find the floor, and then project just the points that are "close to" the floor onto an image that is parallel to the floor.

Finding the floor could be done using open3d.segment_plane. In case there are more planes in the point cloud (for example wall or ceiling) we could find a couple of planes. There is sample code for iteratively finding planes in https://github.com/yuecideng/Multiple_Planes_Detection

jackjansen commented 11 months ago

Turning the pointclouds into images turns out to be complicated. Part of the problem is the different coordinate systems of open3d, opencv and cwipc (left-handed, right-handed, etc). Part of the problem is that we're using map-color-to-depth, so there is not a lot of resolution left in the pointcloud (basically the D-camera resolution, which is 640x576 is the vrsmall-noreg capture. This isn't good enough to detect the aruco markers, at least not when projecting to the "camera plane".

The aruco-detector had no problem with the RGB images (1280-720) so maybe we should capture the coarse calibration pointclouds with map-depth-to-color?

We know that we're looking at planes anyway, when we're looking for the markers.

And, actually, this may be true also for the fine calibration: having more intermediate points on the "surface" of the subject may be a benefit that outweighs the negative effect of having "invented" points (as mentioned above in

I just realised that we should always use map color to depth for our use case

That point still holds for production, I think, but maybe not for registration.

jackjansen commented 11 months ago

Paused working on the aruco markers for now.

Implemented a manual coarse calibration algorithm (basically the same as what we have: select the four coloured corners of the calibration target).

This works, the resulting pointcloud and the correspondence graph are in pointcloud-registration-test/vrsmall-noreg/capture.

Here is the graph:

pointcloud-0001

The correspondence errors are too optimistic, but visual inspection of the graph shows that something like 0.05 is probably the correct number.

But: these pointclouds are so big (500Kpoint per camera) that the analysis is very expensive: it takes about a minute on my M1 Mac, and I didn't wait for it on the Intel Mac. That's a shame, because for our common use case (capturing in a room where the walls are close enough to be captured by the cameras) this would actually be a very useful test capture...

jackjansen commented 11 months ago

In case we decide to go for detecting the Aruco markers in the RGB image and then finding the corresponding points in the point cloud: here are some interesting links about how the conversion of image pixel coordinates to 3D points can be done using librealsense API's:

https://github.com/IntelRealSense/librealsense/issues/8221 https://medium.com/@yasuhirachiba/converting-2d-image-coordinates-to-3d-coordinates-using-ros-intel-realsense-d435-kinect-88621e8e733a https://github.com/IntelRealSense/librealsense/issues/11031

jackjansen commented 10 months ago

Back to fine registration.

There is a thinking error in the idea that only looking at the correspondence between each (camera-N, all-cameras-except-N) is good enough.

This is shown clearly in a new dataset offline-boxes. See there for the full details, but here is a screenshot of the calibration it managed to come up with:

Visually, it is clear what the problem 1 is: the green and blue camera together (or the red and brown camera together) need to be moved to make things better. But looking at each camera in isolation is never going to work: if we would move the green camera to better align with the red one we would completely lose the alignment with the blue camera.

There is also a problem 2 with the analysis. Here is what our code thinks of the alignment it came up with:

captured-boxes-3_histogram_after_step_6

Those numbers (14 mm to 18 mm) are far too optimistic.

For problem 1 a potential solution is to not only look at correspondences between (camera-N, all-cameras-except-N) but also (camera-N-and-M, all-cameras-except-N-and-M) for every combination (N, M) where those two cameras are "closest to each other". If and N-and-M camera combination would end up as candidate-to-fix we would try fixing that "camera" and apply the transformation to both cameras.

But this feels like a hacker-jack solution: it seems like it might fix this problem but I'm not really sure it's really a solution and not a quick hack.

I have the feeling that tackling problem 2 first may be better.

Ideas, anyone?

jackjansen commented 10 months ago

Some progress with coarse calibration.

I'm manually positioning the pointcloud in the open3d viewer so that the virtual camera is approximately where the physical camera was. I then grab the RGB and D images.

I can now detect the aruco markers in the RGB image.

But this is where the good news stops: the next step is converting the u, v coordinates that the aruco detector returns (coordinates in the RGB image) back to x, y, z coordinates in the point cloud.

I think I'm doing all the right things: I'm taking the depth from D[v, u] and using the camera intrinsics to de-project, and then I'm transforming with the extrinsic matrix.

But I always end up with weird origins. I've tried transposing the extrinsic matrix, I've tried mirroring x, or mirroring y and z (which was suggested somewhere) but nothing seems to work.

jackjansen commented 10 months ago

See https://github.com/isl-org/Open3D/issues/6508 for a decent description of the open3d coordinate system (RH y-UP) and its idiosyncrasies with respect to lookat and forward.

cwi-dis / cwipc

Alignment of multi-camera pointclouds needs work #18