cwi-dis / cwipc

MIT License
15 stars 2 forks source link

Alignment of multi-camera pointclouds needs work #18

Open jackjansen opened 2 years ago

jackjansen commented 2 years ago

We need to fix the alignment once and for all.

The original issue is https://gitlab.com/VRTogether_EU/cwipc/cwipc_util/-/issues/41 but this link is dead now.

There is an old repo for experiments at https://github.com/cwi-dis/pointclouds-alignment

New experiment data is at https://github.com/cwi-dis/cwipc_test/tree/master/pointcloud-registration-test

Edit 20-Nov-2023: most of the text above is outdated. Kept for future reference.

Current plan of attack:

jackjansen commented 1 year ago

Current best procedure (copied from Slack):

I think I've just done the best calibration we've ever done. Keeping note here so we can try again (in other words: @shirsub can try this procedure in vrsmall).

  1. Use 4 cameras, not 3.
  2. First do the coarse calibration cwipc_calibrate --kinect --nofine with the A4 sheet on the floor.
  3. Edit cameraconfig, set correct high/low, near/far, radius, erosion=2
  4. Do cwipc_view --kinect with nothing in view, and ensure there are no points. If there are points edit cameraconfig to fix.
  5. At the centerpoint stack 3 IKEA boxes so each camera sees 2 sides (and each side is seen by two cameras)
  6. cwipc_view --kinect and w to save a pointcloud.
  7. cwipc_calibrate --kinect --reuse --nocoarse --nograb pointcloud_xxxxxxxxx.ply --corr 0.05
  8. (The 0.05 is how far the per-camera pointclouds are apart in the grabbed image. For many years I picked lower numbers here all the time, but the trick is you should pick higher numbers.)
  9. Pick algorithm 6 (cumulative multiscale ICP), variant 2 (point2plane)
  10. Inspect result, after - a few times to make the points smaller. If you look at the stack of boxes from above you get a very good idea of the quality.
  11. The way the algorithm works at the moment means (I think) it will be happy after doing 3 steps so with the new calibration I did another grab and calibrate with --corr 0.02 which gave me a pretty good result (at least I think so).
jackjansen commented 1 year ago

I think we should start looking at opencv to help with calibration. For example, we could use an Aruco marker (see https://docs.opencv.org/4.x/d5/dae/tutorial_aruco_detection.html) to help with detection of (0, 0, 0) and do the coarse calibration automatically.

This also gives us camera positions, so we can then determine the camera ordering for the next step (which pairs of cameras we should align).

We could then capture a 3D object (maybe the boxes, but possibly a person is good enough). I have the feeling that we could use the histogram of point-to-point distances to determine how well the coarse calibration worked: I'm expecting a plateau on the left of the histogram, then a drop, then a very long tail. Somewhere along the drop would seem to be a good value for the --corr parameter.

We then use cumulative multi scale ICP or something, and after that recompute the histogram. corr should now be lower.

Then I think we repeat with the other camera pairs.

Question is what to do after all pairs have been done, because possibly the last calibration has broken the first one.

Another option could be to first compute the corr for each pair, and use that to determine the order of doing the aligning, but I'm not sure whether "best first" or "worst first" is the best option. I think "best first" but I'm not sure.

But I do think that after we've finished calibration we should repeat the calculation of corr, because this will give us the optimal cell size for voxelization.

@fonskuijk I would really like your opinion on this idea...

jackjansen commented 10 months ago

Edit 20-Nov-2023: most of the information in this comment is now outdated. See below. Keeping it here for future historic reference.

Okay, let's use this issue to keep note of what we've done (and found out) so far.

The results are difficult to interpret. I was expecting to see some sort of a threshold effect between cellsize > epsilon and cellsize < epsilon, where epsilon is how good the current registration is (in other words: when the cellsize is so large that points from camera 1 and camera 2 would be merged I would expect a super-linear drop in the point count.

This doesn't happen. except for very small and very large cell sizes I'm seeing a more-or-less linear decrease in the number of points (of approximate a factor of 2, as is to be expected as I'm using sort(0.5) as the factor between cell sizes.

point-count-factors-per-cellsize

Next I decided to look at the number of contributing cameras for each result point (after voxelizing) at each cellsize. Again, I was expecting to see something interesting at the boundary of cellsize == epsilon: something like a sudden increase in the number of points with two contributing cameras (at the expense of the number of points with one contributing camera).

But again no such luck: things look vaguely linear. Moreover, already at small cellsizes there are quite a few points that already have three contributing cameras. This should not be possible with the boxes, I think: I think the alignment of the cameras and the boxes was such that each side could really only be seen by two cameras.

n-camera-contributions-cumulative

Either I am making a stupid mistake in my thinking, or there is another bug in the way cwipc_downsample() works. I will investigate the latter first.

Edit: there was indeed another stupid bug in cwipc_downsample() which led to sometimes too many contributing cameras being specified in a point.

Here is the correct version of the previous graph: Screenshot 2023-11-15 at 17 01 50

jackjansen commented 10 months ago

Came up with a wild plan (thanks @Silvia024 !).

What if we take synthetic point cloud, and then create 4 point clouds from that (North, East, South, West), which each of the new point clouds having half the points of the original (basically the ones that can be seen from that direction).

Now we can slightly perturb each of the NESW point clouds, and recombine them into a single "4-camera" point cloud.

This we can now present to voxelize_curve.py and see what it reports. And we know the ground truth, because we have done the perturbations.

So this should allow us to see what the magic numbers are that we are looking for.

Moreover, this scheme (or variations of it, for example introducing some noise) could also be used to do automatic testing of our point loud registration algorithms (once we have written them:-).

jackjansen commented 10 months ago

That was a bit of a disappointment, so far. Or actually maybe it just showed our erronous thinking:-)

Implemented the generation of registration-test pointclouds, with a settable offset for camera 2 and 4. (1 and 3 will be correct).

Created pointclouds with an error of 0.03 and 0.003.

The results are in https://github.com/cwi-dis/cwipc_test/tree/master/pointcloud-registration-test but the quick message is that the curves still look pretty linear. There is a clear "minimum voxelsize" where points from two cameras where points from multiple cameras start to be combined, but this "minimum voxelsize" is quite a bit smaller than the pre-determined error with which the point clouds were generated.

I could say something (based on the data for these pointclouds) like: "As soon as more than 15% of the points come from two cameras our voxelsize is larger than the error" but this would just be data-fitting.

@Silvia024 can you have a look at the data?

jackjansen commented 10 months ago

Another wild idea. Comments please.

For each voxelsize we voxelize the whole pointcloud, just as we do now. But we also take the 4 per-camera pointclouds and voxelize those individually. We add the pointcounts of those cloud.

We now compare the two numbers. This comparison is a measure for how much camera-combining has happened.

Edit 20-Nov-2023: We decided not to go down this path, because we dropped the idea of using voxelization to compute alignment quality.

jackjansen commented 10 months ago

And yet another wild idea. We are now doing this on the whole pointcloud, maybe that is a bad idea because it confuses our numbers.

What if we construct the 6 pairwise point clouds (cam1+2, cam1+3, etc) and run this analysis separately on each pairwise pointcloud?

Edit: this did not work. I experimented with it by creating the generated point clouds (0.03 and 0.003) with only two cameras (dropping the other two). The curves are just as meaningless as for the full 4-camera cloud.

I am pretty sure that the same is true for the previous wild idea.

Silvia024 commented 10 months ago

maybe tomorrow we can chat better about this?

btw I just found this: https://github.com/IntelRealSense/librealsense/issues/10795 and it seems quite relevant and it might be of interest. I will try to read it tonight

Silvia024 commented 10 months ago

I found two interesting papers...

jackjansen commented 10 months ago

Updated two previous comments (the two wild ideas): they turned out not to work. I have also experimented with changing the "next cellsize factor" from sqrt(0.5) to sqrt(sqrt(0.5)) but there is no appreciable bump around the known epsilon with which the point clouds are created.

I now think that the whole idea of using voxelization at successively smaller sizes and looking for a discontinuity simply doesn't work.

To come up with a theory: voxelization is essentially a stochastic procedure, because it depends on the origin of the grid and the individual original points have no relationship to that grid. So whether or not two points are combined if they are less that cellsize apart is a chance: if cellsize > 2*distance they are definitely combined, if cellsize < distance/sqrt(3) they are definitely not, but the area in between is grey. @Silvia024 does this sound reasonable?

@Silvia024 @troeggla please up-vote this comment if you agree that voxelization is a dead end. Then I can remove all the generated data and spreadsheets over in cwipc_test/pointcloud-registration-test (which is quite a lot).

jackjansen commented 10 months ago

My suggestion for the next try is to compute foreach point in cam1cloud the minimum distance to any point in cam2cloud. First thing to look at is the histogram of these distances. (April 9 comment). I hope we would see a very large peak just below epsilon with a very long tail (all the cam1 points that have no corresponding cam2 point).

I think the KD-Tree is the data structure we want. And it appears to be supported in NumPy, example code is in https://medium.com/@OttoYu/exploring-kd-tree-in-point-cloud-c9c767095923

jackjansen commented 10 months ago

The idea of using KDTree and plotting a histogram of the mimic distances looks very promising.

There's a Jupyter notebook in cwipc_test, I'm going to convert it to a script.

Silvia024 commented 10 months ago

Updated two previous comments (the two wild ideas): they turned out not to work. I have also experimented with changing the "next cellsize factor" from sqrt(0.5) to sqrt(sqrt(0.5)) but there is no appreciable bump around the known epsilon with which the point clouds are created.

I now think that the whole idea of using voxelization at successively smaller sizes and looking for a discontinuity simply doesn't work.

To come up with a theory: voxelization is essentially a stochastic procedure, because it depends on the origin of the grid and the individual original points have no relationship to that grid. So whether or not two points are combined if they are less that cellsize apart is a chance: if cellsize > 2*distance they are definitely combined, if cellsize < distance/sqrt(3) they are definitely not, but the area in between is grey. @Silvia024 does this sound reasonable?

@Silvia024 @troeggla please up-vote this comment if you agree that voxelization is a dead end. Then I can remove all the generated data and spreadsheets over in cwipc_test/pointcloud-registration-test (which is quite a lot).

the theory seems correct to me :) agree that maybe it's not a good strategy

jackjansen commented 10 months ago

I'm getting beautiful graphs, plotting (per camera pair) the cumulative distance-to-other-cloud.

cumdist

jackjansen commented 10 months ago

All the graphs are now in cwipc_test. They look mostly usable. I'm going to change the Y axis from being counts to being fractions, that should make them even easier to interpret (I hope).

jackjansen commented 9 months ago

For reference in further discussion below: here is the current state of the graph for the boxes in vrsmall (with one camera about 1cm off):

boxes_cumdist

And here is one of the Jack point clouds (the best one, the sideways one):

jack-sideways_cumdist

And here is the one for the generated point cloud, with a 0.03 m offset of two of the cameras:

genregtest03_cumdist

jackjansen commented 9 months ago

First thing I would like to do is convert each of these lines into a number correspondence. This number should be an upper bound for the registration error between the two cameras in that pair, for the given capture. If we are pretty sure the cameras have no significant overlap in field of view (pairs (0, 3) and (1, 2) in the graphs above) we return an invalid value (NaN or None or something like that).

Once we have this number we can do two things with it:

jackjansen commented 9 months ago

Idea: we should look at the derivative of these lines (which is actually the distance histogram data) and the second derivative. Possibly after smoothing the lines a little.

The second derivative should start positive, then go negative, then go positive again. Maybe that last point is the number we're interested in.

jackjansen commented 9 months ago

Unrelated idea (@Silvia024 comments please): we are currently looking at the symmetric combined difference between camera pairs.

Would it be better to do a per-camera comparison to all other cameras combined? I.e. not compute and graph the distance per camera pair, but in stead look at the distance of the points of camera N to the points of all other cameras together?

That should extend the "tail" of the histogram, probably making it more linear-ish. Also, the resulting number correspondence should be more useful, because it's a number that pertains to a single camera (in stead of to a pair of cameras).

jackjansen commented 9 months ago

Would it be better to do a per-camera comparison to all other cameras combined?

Implemented, and it seems to produce much better results. So much better that We need to check that we are not looking at bullshit data.

Here are the graphs (for single-camera-to-all-others) that are for the same dataset as the pair graphs of yesterday:

boxes_cumdist_one2all

jack-sideways_cumdist_one2all

genregtest03_cumdist_one2all

jackjansen commented 9 months ago

And here are the histogram graphs of the same datasets:

boxes_histogram_one2all

jack-sideways_histogram_one2all

genregtest03_histogram_one2all

Silvia024 commented 9 months ago

The original issue is https://gitlab.com/VRTogether_EU/cwipc/cwipc_util/-/issues/41

@jackjansen, I can't access to this link. It says page not found. Maybe do I not have the right to access it?

jackjansen commented 9 months ago

The original issue is https://gitlab.com/VRTogether_EU/cwipc/cwipc_util/-/issues/41

@jackjansen, I can't access to this link. It says page not found. Maybe do I not have the right to access it?

That link is dead (the gitlab site disappeared when we stopped paying for it).

jackjansen commented 9 months ago

Getting started with a guess at the correspondence (when the histogram is "over the hill", and has descended about half way down). Looks good for the generated point clouds, and usable for the captured ones (in the right ballpark).

boxes_histogram_one2all

jack-sideways_histogram_one2all

genregtest03_histogram_one2all

jackjansen commented 9 months ago

I just realised that the correspondence error alone is not good enough to determine in which order we should attempt to re-align the cameras. I previously thought it would be enough: just re-align the cameras starting with the worst one.

But I now think we also want to know how many points had a distance below the correspondence error: if camera 1 has corr=0.01, n_corr_points=20 and camera 2 has corr=0.005, n_corr_points=20000 it's probably a much better idea to start with camera 2.

Silvia024 commented 9 months ago

Some points from our meeting this morning.

jackjansen commented 9 months ago

Yeah! Implemented the first registration algorithm (ICP point-to-point) and it's already a great success!

Here are the graphs for the worst dataset (Jack frontal in vrsmall):

jack-forward_histogram_one2all

jack-forward_histogram_after_step_5

Algorithm used:

find worst-aligned camera and its correspondence
repeat {
  re-align that camera using ICP-point2point
  re-create the pointcloud with the new alignment for that camera
  find worst-aligned camera and its correspondence
} until (no improvement since last step)

Observations:

jackjansen commented 9 months ago

Yet another wild idea, this time on the "ground truth" dataset we use (both for testing and for the eventual writing of the paper).

If we start with one of the loot pointclouds and split it into two tiles by randomly assigning each of the points to a tile we know exactly what the measured correspondence should be: it is the cell size of the original 1024*1024*1024 ply-file that we created the pointcloud from (the 8i files use integer coordinates, so we know the cellsize exactly).

jackjansen commented 9 months ago

"worst aligned camera" is currently determined by multiplying measured mis-alignment by the number of points in the alignment set. As you can see in the bottom graph above this is a bit too aggressive: camera 2 has a mis-alignment that is worse than camera 1 but it wasn't selected. We should try sqrt(nAlignmentPoints) or log(nAlignmentPoints) as the factor to see whether it makes a difference.

That seems to be a good idea, with log() working better than sqrt(). Here are the results for jack-sideways, first with using nAlignmentPoints in the weight, then with using log(nAlignmentPoints). We not only did an extra step in the latter case, but also aligned different cameras at the previous steps. You can see that the latter got to a better result eventually.

But: we may be able to get even better results through completely different means. For example: if our previous algorithm didn't get anymore improvement (the end condition of the loop above) we just re-align all cameras once in no particular order. And if that resulted in any improvement we continue with our loop.

jack-sideways_histogram_after_step_6_nPoints

jack-sideways_histogram_after_step_7

jackjansen commented 9 months ago

Note for the future: at the end of the re-registration run we have a correspondence. We should voxelize the pointcloud to this number (or possibly some function f() of this number). The expectation is that this vozelized pontcloud will nicely have all the "synthesised cameras" for the tile overlaps. We can of course visually inspect this resulting point cloud, but we can probably also compute how well the whole process worked by running the registration analyser on it.

This pointcloud will have 15 virtual cameras, and the correspondence between those cameras should be very bad. Actually, the number to look at now is not so much the correspondence error but the number of points in the correspondence set: we are expecting only the points that are "on the edge" of the tile to have a low distance to the other tile clouds.

jackjansen commented 9 months ago

But: we may be able to get even better results through completely different means. For example: if our previous algorithm didn't get anymore improvement (the end condition of the loop above) we just re-align all cameras once in no particular order. And if that resulted in any improvement we continue with our loop.

Implemented this (looping over the other cameras after the worst camera doesn't show anymore improvement), and it turns out not to be a good idea.

In essence the only thing it does is to move some "badness" from one camera to another. So, each time through the loop the camera that's being re-aligned will improve a bit, but other cameras will get a bit worse. I'm not sure I understand why this happens...

And there's a problem about determining when to terminate.

Silvia024 commented 9 months ago

Observations:

  • All captured datasets seem to end with alignment errors of around 3.5 mm. I wouldn't be surprised if this turns out to be the resolution of the Kinects at the given distance between camera and subject (around 2 m I think). Anyone willing to compute this number?

I think so, I found that it should be: "~ 3 mm per pixel x by y at 2 m"

  • A result of the previous point is that I don't think the ICP variants that take color into account are going to provide any benefit for us. @Silvia024 @ashutosh3308 what do the scientists think? need to check this. I don't have an answer now

  • It think ICP-point2plane may be able to do better than the 3 mm we are getting now, because the "plane" will be determined using some form of interpolation, and therefore the camera resolution at the given distance should play less of a role. @Silvia024 @ashutosh3308 what do the scientists think? And more importantly: is there anything about this in the literature?

From a quick research, I see that ICP-point2point has generally worse performance than ICP-point2plane but is fast to compute. Will check better in the literature

ireneviola commented 9 months ago

Ideas: -first use geometry-only ICP, and then try with color to see if it gives any improvement @jackjansen -try to check temporal consistency to improve alignment dynamically -Publication idea: testing that the alignment metric works + benchmarking a bunch of registration algorithms -With lots of data, see if we can bring the precision error to a minimum by optimizing over the matrices

Silvia024 commented 9 months ago

This is the path for calibration algorithms: python/cwipc/scripts/cwipc_calibrate/calibrator.py (https://github.com/cwi-dis/cwipc_util/blob/18-alignment/python/cwipc/scripts/cwipc_calibrate/calibrator.py) - 18-alignment

jackjansen commented 9 months ago

Implemented ICP_Point2Plane, but it seems to be performing slightly worse than ICP_Point2Point.

I can see two possible reasons for this:

  1. Our analysis (the correspondence computing) is not good enough (or actually: its precision is not good enough). The analysis basically uses Point2Point distance, so I can imagine that the values after ICP_Point2Point and ICP_Point2Plane both fall within the precision of the analysis.
  2. The Point2Plane realignment estimates normals for each point, and this estimation is done using two magic values that determine how many neighbouring points are used to estimate the normals. The two magic values are distance and nNeigbhour. I have no idea where the values of 0.02 and 30 that we currently use come from. They may be wildly incorrect for our use case (human bodies).

I think the best way forward is to first create the loot-based test data set, described above.

jackjansen commented 9 months ago

Created the "simple" loot-based dataset (i.e. loot split into 4 halves). This showed enough problems with the algorithms that it kept me busy for a day:-)

One of the problems is that the "outer algorithm", the one that selects the camera to re-align, can start oscillating: fix camera 2 but thereby making camera 4 worse, then fixing camera 4 making camera 2 worse, etc. Currently the stop condition is based on the error of the worst camera (taking weight into account), maybe the stop condition should be based (or also based?) on the overall error?

And there is a problem of ever-diminishing returns for an extra run (which showed up with loot). This has been fixed for now by stopping if the error is less than 0.001, but this number should really be based on the precision of the source point cloud.

jackjansen commented 9 months ago

And from a practical point of view: Maybe now is the time to turn cwipc_calibrate and the sandbox/registration_error_* code into a few classes, and create a new script cwipc_register that uses those, to replace cwipc_calibrate.

Initially the classes would use the current method of coarse calibration (selecting of the points), but the work on automating that could then happen in parallel.

jackjansen commented 9 months ago

Implemented part of

And from a practical point of view: Maybe now is the time to turn cwipc_calibrate and the sandbox/registrationerror* code into a few classes,

and also

Note for the future: at the end of the re-registration run we have a correspondence. We should voxelize the pointcloud to this number (or possibly some function f() of this number). The expectation is that this vozelized pointcloud will nicely have all the "synthesised cameras" for the tile overlaps.

The results are somewhat disappointing: for reasonable choices of f() (I tried factor*correspondence, for factors sqrt(0.5), 1, sqrt(2), 2) there are not many points in the "intermediate" tiles. About 10% or so when using 1*correspondence. I have to go up to a fact or 4 or so before a significant number of points get into the new "synthesised intermediate camera" tiles. Even for the boxes dataset.

I have no idea why this happens, nor a good idea of how we could investigate it.

jackjansen commented 9 months ago

Got started on playing with Aruco markers. Created a dataset vrsmall-noreg that has a capture of aruco-markers.

Looks like it's absolutely feasible to find the markers in the RGB images, but the (u,v) indices in RGB-image-space cannot be converted to (x, y, z) coordinates in PLY-space easily.

Three possible paths forward:

Also found a message from someone trying to do similar things to use: https://forum.openframeworks.cc/t/align-multiple-kinect-point-clouds-with-aruco-markers/41182

jackjansen commented 9 months ago

Interesting results also on fine-calibration.

Created a dataset vrsmall-worseafter tilting three cameras in three different planes (x,y,z). The resulting point clouds are clearly off.

The fixer managed to make them all upright, but the's where the good news stops. They're still off by quite a bit, and moreover (and worse): the analysis algorithm produces way too optimistic values.

That is the first problem to fix. It s clearest from examining the graphs in vrsmall-worse/boxes and comparing the computed correspondence values to what a human can see from the plots: the values are way too optimistic.

The second thing to think about is the floor: whether we should include the floor in the capture. vrsmall-worse/boxes shows a reason to do so: it would probably have forestalled that the camera 4 point cloud (the green one) is positioned "too high". But on the other hand: having a lot of "floor points" matched may lead to the overall algorithm to be happy too sone ("I've matched up all these floor points, who cares about a few person points").

jackjansen commented 9 months ago

the analysis algorithm produces way too optimistic values.

Let's start by analysing this problem. Here's the initial distances histogram for vrsmall-worse/boxes:

boxes_histogram_one2all

Visually it is clear that both camera 1 (blue) and 2 (orange) have a correspondence error of approximately 0.1, but the analysis algorithm gives a value of around 0.04.

Also, camera 8 (red) seems to have preciously few points in the correspondence set, making that result not very trustworthy. Similar to a lesser extent for camera 4 (green).

jackjansen commented 9 months ago

Wild idea: part of the problem is that we are computing correspondence on the whole point clouds S (source, the points from the camera that we are interested in) and D (destination, all the other camera point clouds combined).

If we could in stead compute the subset point clouds Soverlap and Doverlap consisting of the points that could reasonably be expected to be in the overlap between S and D this should help the computation of correspondence a lot: For each point in Soverlap we compute the distance to the nearest point in Doverlap, then compute mean and stddev of those distances. Those two numbers should allow is to compute a good candidate for the correspondence (probably something like mean + stddev/2 or so, but I'll leave that to the mathematicians:-).

Question to the scientists: does this make sense so far?

If the answer is yes, then here is another wild idea on how to compute Doverlap (and the same would work for Soverlap).

nNeighbour = 4 (or maybe 3, or 8, but some reasonably small value)
For each point in D add a field usage_count
for each point in S:
    compute the nNeighbour nearest neighbours in D
    for each of those nNeighbour neighbours
        increment usage_count
Remove all points from D that have usage_count == 0
Remove all points from D that have usage_count > nNeighbour

Rationale: the points in D with usage_count == 0 are apparently very far away from any point in S (no point in S selected them as a close neighbour), so we can get rid of them. On the other hand, the points with usage_count > some threshold are apparently very close to S, because many points in S that do not have any "natural neighbours" in D have selected them as one of the closest. So we should get rid of these points as well, because they will skew the mean and stddev later. Using nNeighbour as some threshold is a wild guess.

Scientists: does this sound reasonable?

jackjansen commented 9 months ago

I guess today is a wild-idea-day:-)

Got started on playing with Aruco markers. Created a dataset vrsmall-noreg that has a capture of aruco-markers.

I was thinking on how we could create images to search from the point clouds (as opposed to using the captured images "aux data", because we don't always have these, and also because as stated above it is difficult to convert the image (u, v) coordinates back to spatial (x, y, z) coordinates).

What if we create three planes, based on the camera coordinate system: x=0, y=0 and z=0. On each of these planes we make two distinct projections, for example for the x=0 plane we make one image where we project all points with x>=0 and ignore points with x<0, etc. (Actually: we need only 5 planes, because for z=0 one of the planes would be empty as the camera can't see behind itself).

I guess we can initialise each of the planes to something funny (green?) and hope the aruco-finding algorithm will work around that.

jackjansen commented 9 months ago

And if it turns out that the previous idea doesn't work: we could try to find the floor, and then project just the points that are "close to" the floor onto an image that is parallel to the floor.

Finding the floor could be done using open3d.segment_plane. In case there are more planes in the point cloud (for example wall or ceiling) we could find a couple of planes. There is sample code for iteratively finding planes in https://github.com/yuecideng/Multiple_Planes_Detection

jackjansen commented 9 months ago

Turning the pointclouds into images turns out to be complicated. Part of the problem is the different coordinate systems of open3d, opencv and cwipc (left-handed, right-handed, etc). Part of the problem is that we're using map-color-to-depth, so there is not a lot of resolution left in the pointcloud (basically the D-camera resolution, which is 640x576 is the vrsmall-noreg capture. This isn't good enough to detect the aruco markers, at least not when projecting to the "camera plane".

The aruco-detector had no problem with the RGB images (1280-720) so maybe we should capture the coarse calibration pointclouds with map-depth-to-color?

We know that we're looking at planes anyway, when we're looking for the markers.

And, actually, this may be true also for the fine calibration: having more intermediate points on the "surface" of the subject may be a benefit that outweighs the negative effect of having "invented" points (as mentioned above in

I just realised that we should always use map color to depth for our use case

That point still holds for production, I think, but maybe not for registration.

jackjansen commented 8 months ago

Paused working on the aruco markers for now.

Implemented a manual coarse calibration algorithm (basically the same as what we have: select the four coloured corners of the calibration target).

This works, the resulting pointcloud and the correspondence graph are in pointcloud-registration-test/vrsmall-noreg/capture.

Here is the graph:

pointcloud-0001

The correspondence errors are too optimistic, but visual inspection of the graph shows that something like 0.05 is probably the correct number.

But: these pointclouds are so big (500Kpoint per camera) that the analysis is very expensive: it takes about a minute on my M1 Mac, and I didn't wait for it on the Intel Mac. That's a shame, because for our common use case (capturing in a room where the walls are close enough to be captured by the cameras) this would actually be a very useful test capture...

jackjansen commented 8 months ago

In case we decide to go for detecting the Aruco markers in the RGB image and then finding the corresponding points in the point cloud: here are some interesting links about how the conversion of image pixel coordinates to 3D points can be done using librealsense API's:

https://github.com/IntelRealSense/librealsense/issues/8221 https://medium.com/@yasuhirachiba/converting-2d-image-coordinates-to-3d-coordinates-using-ros-intel-realsense-d435-kinect-88621e8e733a https://github.com/IntelRealSense/librealsense/issues/11031

jackjansen commented 8 months ago

Back to fine registration.

There is a thinking error in the idea that only looking at the correspondence between each (camera-N, all-cameras-except-N) is good enough.

This is shown clearly in a new dataset offline-boxes. See there for the full details, but here is a screenshot of the calibration it managed to come up with:

Screenshot 2024-01-01 at 23 55 34

Visually, it is clear what the problem 1 is: the green and blue camera together (or the red and brown camera together) need to be moved to make things better. But looking at each camera in isolation is never going to work: if we would move the green camera to better align with the red one we would completely lose the alignment with the blue camera.

There is also a problem 2 with the analysis. Here is what our code thinks of the alignment it came up with:

captured-boxes-3_histogram_after_step_6

Those numbers (14 mm to 18 mm) are far too optimistic.

For problem 1 a potential solution is to not only look at correspondences between (camera-N, all-cameras-except-N) but also (camera-N-and-M, all-cameras-except-N-and-M) for every combination (N, M) where those two cameras are "closest to each other". If and N-and-M camera combination would end up as candidate-to-fix we would try fixing that "camera" and apply the transformation to both cameras.

But this feels like a hacker-jack solution: it seems like it might fix this problem but I'm not really sure it's really a solution and not a quick hack.

I have the feeling that tackling problem 2 first may be better.

Ideas, anyone?

jackjansen commented 8 months ago

Some progress with coarse calibration.

I'm manually positioning the pointcloud in the open3d viewer so that the virtual camera is approximately where the physical camera was. I then grab the RGB and D images.

I can now detect the aruco markers in the RGB image.

But this is where the good news stops: the next step is converting the u, v coordinates that the aruco detector returns (coordinates in the RGB image) back to x, y, z coordinates in the point cloud.

I think I'm doing all the right things: I'm taking the depth from D[v, u] and using the camera intrinsics to de-project, and then I'm transforming with the extrinsic matrix.

But I always end up with weird origins. I've tried transposing the extrinsic matrix, I've tried mirroring x, or mirroring y and z (which was suggested somewhere) but nothing seems to work.

jackjansen commented 8 months ago

See https://github.com/isl-org/Open3D/issues/6508 for a decent description of the open3d coordinate system (RH y-UP) and its idiosyncrasies with respect to lookat and forward.