Open jackjansen opened 2 years ago
Current best procedure (copied from Slack):
I think I've just done the best calibration we've ever done. Keeping note here so we can try again (in other words: @shirsub can try this procedure in vrsmall).
cwipc_calibrate --kinect --nofine
with the A4 sheet on the floor.cwipc_view --kinect
with nothing in view, and ensure there are no points. If there are points edit cameraconfig to fix.cwipc_view --kinect
and w to save a pointcloud.cwipc_calibrate --kinect --reuse --nocoarse --nograb pointcloud_xxxxxxxxx.ply --corr 0.05
--corr 0.02
which gave me a pretty good result (at least I think so).I think we should start looking at opencv
to help with calibration. For example, we could use an Aruco marker (see https://docs.opencv.org/4.x/d5/dae/tutorial_aruco_detection.html) to help with detection of (0, 0, 0)
and do the coarse calibration automatically.
This also gives us camera positions, so we can then determine the camera ordering for the next step (which pairs of cameras we should align).
We could then capture a 3D object (maybe the boxes, but possibly a person is good enough). I have the feeling that we could use the histogram of point-to-point distances to determine how well the coarse calibration worked: I'm expecting a plateau on the left of the histogram, then a drop, then a very long tail. Somewhere along the drop would seem to be a good value for the --corr
parameter.
We then use cumulative multi scale ICP or something, and after that recompute the histogram. corr
should now be lower.
Then I think we repeat with the other camera pairs.
Question is what to do after all pairs have been done, because possibly the last calibration has broken the first one.
Another option could be to first compute the corr
for each pair, and use that to determine the order of doing the aligning, but I'm not sure whether "best first" or "worst first" is the best option. I think "best first" but I'm not sure.
But I do think that after we've finished calibration we should repeat the calculation of corr
, because this will give us the optimal cell size for voxelization.
@fonskuijk I would really like your opinion on this idea...
Edit 20-Nov-2023: most of the information in this comment is now outdated. See below. Keeping it here for future historic reference.
Okay, let's use this issue to keep note of what we've done (and found out) so far.
cwipc_test
that have been grabbed in vrbig and vrsmall. One of the stack of boxes, one of Jack facing the front camera and one of Jack facing between the front camera and the next camera.sandbox/voxelize_curve.py
(now badly named) that vowelises a point cloud by ever decreasing voxelsize. It creates a CSV file with an entry per voxelsize, showing how many points there are, how many points with 1 contributing camera, how many points with 2 contributing cameras, etc.--filter 'colorize(1,"camera")'
on it.voxelize_curve_spreadsheet.numbers
to graph them.cwipc_view
and cwipc_downsample()
to make this all work somewhat)The results are difficult to interpret. I was expecting to see some sort of a threshold effect between cellsize > epsilon
and cellsize < epsilon
, where epsilon
is how good the current registration is (in other words: when the cellsize is so large that points from camera 1 and camera 2 would be merged I would expect a super-linear drop in the point count.
This doesn't happen. except for very small and very large cell sizes I'm seeing a more-or-less linear decrease in the number of points (of approximate a factor of 2, as is to be expected as I'm using sort(0.5)
as the factor between cell sizes.
Next I decided to look at the number of contributing cameras for each result point (after voxelizing) at each cellsize. Again, I was expecting to see something interesting at the boundary of cellsize == epsilon
: something like a sudden increase in the number of points with two contributing cameras (at the expense of the number of points with one contributing camera).
But again no such luck: things look vaguely linear. Moreover, already at small cellsizes there are quite a few points that already have three contributing cameras. This should not be possible with the boxes, I think: I think the alignment of the cameras and the boxes was such that each side could really only be seen by two cameras.
Either I am making a stupid mistake in my thinking, or there is another bug in the way cwipc_downsample()
works. I will investigate the latter first.
Edit: there was indeed another stupid bug in cwipc_downsample()
which led to sometimes too many contributing cameras being specified in a point.
Here is the correct version of the previous graph:
Came up with a wild plan (thanks @Silvia024 !).
What if we take synthetic point cloud, and then create 4 point clouds from that (North, East, South, West), which each of the new point clouds having half the points of the original (basically the ones that can be seen from that direction).
Now we can slightly perturb each of the NESW point clouds, and recombine them into a single "4-camera" point cloud.
This we can now present to voxelize_curve.py
and see what it reports. And we know the ground truth, because we have done the perturbations.
So this should allow us to see what the magic numbers are that we are looking for.
Moreover, this scheme (or variations of it, for example introducing some noise) could also be used to do automatic testing of our point loud registration algorithms (once we have written them:-).
That was a bit of a disappointment, so far. Or actually maybe it just showed our erronous thinking:-)
Implemented the generation of registration-test pointclouds, with a settable offset for camera 2 and 4. (1 and 3 will be correct).
Created pointclouds with an error of 0.03
and 0.003
.
The results are in https://github.com/cwi-dis/cwipc_test/tree/master/pointcloud-registration-test but the quick message is that the curves still look pretty linear. There is a clear "minimum voxelsize" where points from two cameras where points from multiple cameras start to be combined, but this "minimum voxelsize" is quite a bit smaller than the pre-determined error with which the point clouds were generated.
I could say something (based on the data for these pointclouds) like: "As soon as more than 15% of the points come from two cameras our voxelsize is larger than the error" but this would just be data-fitting.
@Silvia024 can you have a look at the data?
Another wild idea. Comments please.
For each voxelsize we voxelize the whole pointcloud, just as we do now. But we also take the 4 per-camera pointclouds and voxelize those individually. We add the pointcounts of those cloud.
We now compare the two numbers. This comparison is a measure for how much camera-combining has happened.
Edit 20-Nov-2023: We decided not to go down this path, because we dropped the idea of using voxelization to compute alignment quality.
And yet another wild idea. We are now doing this on the whole pointcloud, maybe that is a bad idea because it confuses our numbers.
What if we construct the 6 pairwise point clouds (cam1+2, cam1+3, etc) and run this analysis separately on each pairwise pointcloud?
Edit: this did not work. I experimented with it by creating the generated point clouds (0.03 and 0.003) with only two cameras (dropping the other two). The curves are just as meaningless as for the full 4-camera cloud.
I am pretty sure that the same is true for the previous wild idea.
maybe tomorrow we can chat better about this?
btw I just found this: https://github.com/IntelRealSense/librealsense/issues/10795 and it seems quite relevant and it might be of interest. I will try to read it tonight
I found two interesting papers...
https://www.sciencedirect.com/science/article/abs/pii/S0143816618308273 here I found interesting that they use a cube with chessboard patterns on four sides as calibration object so they know all the measurements (e.g., distance between the edge of chessboard pattern). They put the cube such that one camera capture mainly only one side of the cube. What it's interesting to me is that to verify the reconstructed point clouds, they check if two adjacent side of the cube are orthogonal;
https://dl.acm.org/doi/abs/10.1145/3570361.3592530?casa_token=sNaOGgeRC5IAAAAA:4_WVQOtqLz9sVcqmvRoBOMKKBsBPZX6vtJjFrNd553VoJes5l49chIb8ytQejYnJUJWjCtdPMcKW this one is very recent (just published at MobiCom!) and I think they are doing what we also would like to have in our system :"We propose Dynamic Camera Calibration (DCC) in MetaStream to address this limitation. The DCC module dynamically obtains and manages the camera pose in real-time and calibrates cameras without any pre-defined makers".
Updated two previous comments (the two wild ideas): they turned out not to work. I have also experimented with changing the "next cellsize factor" from sqrt(0.5)
to sqrt(sqrt(0.5))
but there is no appreciable bump around the known epsilon with which the point clouds are created.
I now think that the whole idea of using voxelization at successively smaller sizes and looking for a discontinuity simply doesn't work.
To come up with a theory: voxelization is essentially a stochastic procedure, because it depends on the origin of the grid and the individual original points have no relationship to that grid. So whether or not two points are combined if they are less that
cellsize
apart is a chance: ifcellsize > 2*distance
they are definitely combined, ifcellsize < distance/sqrt(3)
they are definitely not, but the area in between is grey. @Silvia024 does this sound reasonable?
@Silvia024 @troeggla please up-vote this comment if you agree that voxelization is a dead end. Then I can remove all the generated data and spreadsheets over in cwipc_test/pointcloud-registration-test
(which is quite a lot).
My suggestion for the next try is to compute foreach point in cam1cloud
the minimum distance to any point in cam2cloud
. First thing to look at is the histogram of these distances. (April 9 comment). I hope we would see a very large peak just below epsilon
with a very long tail (all the cam1 points that have no corresponding cam2 point).
I think the KD-Tree is the data structure we want. And it appears to be supported in NumPy, example code is in https://medium.com/@OttoYu/exploring-kd-tree-in-point-cloud-c9c767095923
The idea of using KDTree and plotting a histogram of the mimic distances looks very promising.
There's a Jupyter notebook in cwipc_test, I'm going to convert it to a script.
Updated two previous comments (the two wild ideas): they turned out not to work. I have also experimented with changing the "next cellsize factor" from
sqrt(0.5)
tosqrt(sqrt(0.5))
but there is no appreciable bump around the known epsilon with which the point clouds are created.I now think that the whole idea of using voxelization at successively smaller sizes and looking for a discontinuity simply doesn't work.
To come up with a theory: voxelization is essentially a stochastic procedure, because it depends on the origin of the grid and the individual original points have no relationship to that grid. So whether or not two points are combined if they are less that
cellsize
apart is a chance: ifcellsize > 2*distance
they are definitely combined, ifcellsize < distance/sqrt(3)
they are definitely not, but the area in between is grey. @Silvia024 does this sound reasonable?@Silvia024 @troeggla please up-vote this comment if you agree that voxelization is a dead end. Then I can remove all the generated data and spreadsheets over in
cwipc_test/pointcloud-registration-test
(which is quite a lot).
the theory seems correct to me :) agree that maybe it's not a good strategy
I'm getting beautiful graphs, plotting (per camera pair) the cumulative distance-to-other-cloud.
All the graphs are now in cwipc_test. They look mostly usable. I'm going to change the Y axis from being counts to being fractions, that should make them even easier to interpret (I hope).
For reference in further discussion below: here is the current state of the graph for the boxes in vrsmall (with one camera about 1cm off):
And here is one of the Jack point clouds (the best one, the sideways one):
And here is the one for the generated point cloud, with a 0.03 m
offset of two of the cameras:
First thing I would like to do is convert each of these lines into a number correspondence
. This number should be an upper bound for the registration error between the two cameras in that pair, for the given capture. If we are pretty sure the cameras have no significant overlap in field of view (pairs (0, 3)
and (1, 2)
in the graphs above) we return an invalid value (NaN
or None
or something like that).
Once we have this number we can do two things with it:
--corr
parameter to the various registration algorithms we havecorrespondence
number and ascertain it is lower than the previous value.Idea: we should look at the derivative of these lines (which is actually the distance histogram data) and the second derivative. Possibly after smoothing the lines a little.
The second derivative should start positive, then go negative, then go positive again. Maybe that last point is the number we're interested in.
Unrelated idea (@Silvia024 comments please): we are currently looking at the symmetric combined difference between camera pairs.
Would it be better to do a per-camera comparison to all other cameras combined? I.e. not compute and graph the distance per camera pair, but in stead look at the distance of the points of camera N
to the points of all other cameras together?
That should extend the "tail" of the histogram, probably making it more linear-ish. Also, the resulting number correspondence
should be more useful, because it's a number that pertains to a single camera (in stead of to a pair of cameras).
Would it be better to do a per-camera comparison to all other cameras combined?
Implemented, and it seems to produce much better results. So much better that We need to check that we are not looking at bullshit data.
Here are the graphs (for single-camera-to-all-others) that are for the same dataset as the pair graphs of yesterday:
And here are the histogram graphs of the same datasets:
The original issue is https://gitlab.com/VRTogether_EU/cwipc/cwipc_util/-/issues/41
@jackjansen, I can't access to this link. It says page not found. Maybe do I not have the right to access it?
The original issue is https://gitlab.com/VRTogether_EU/cwipc/cwipc_util/-/issues/41
@jackjansen, I can't access to this link. It says page not found. Maybe do I not have the right to access it?
That link is dead (the gitlab site disappeared when we stopped paying for it).
Getting started with a guess at the correspondence (when the histogram is "over the hill", and has descended about half way down). Looks good for the generated point clouds, and usable for the captured ones (in the right ballpark).
I just realised that the correspondence error alone is not good enough to determine in which order we should attempt to re-align the cameras. I previously thought it would be enough: just re-align the cameras starting with the worst one.
But I now think we also want to know how many points had a distance below the correspondence error: if camera 1 has corr=0.01, n_corr_points=20
and camera 2 has corr=0.005, n_corr_points=20000
it's probably a much better idea to start with camera 2.
Some points from our meeting this morning.
Yeah! Implemented the first registration algorithm (ICP point-to-point) and it's already a great success!
Here are the graphs for the worst dataset (Jack frontal in vrsmall):
Algorithm used:
find worst-aligned camera and its correspondence
repeat {
re-align that camera using ICP-point2point
re-create the pointcloud with the new alignment for that camera
find worst-aligned camera and its correspondence
} until (no improvement since last step)
Observations:
3.5 mm
. I wouldn't be surprised if this turns out to be the resolution of the Kinects at the given distance between camera and subject (around 2 m
I think). Anyone willing to compute this number?map color to depth
for our use case (inwards looking at people), and never map depth to color
. Because the latter will invent depth values through interpolation, which is not only harmful at the boundaries of the person (interpolation to the depth of the wall behind), but it also really won't help anything: it will really only work for predominantly flat surfaces with a lot of details in RGB (because the higher resolution of the RGB camera).ICP-point2plane
may be able to do better than the 3 mm
we are getting now, because the "plane" will be determined using some form of interpolation, and therefore the camera resolution at the given distance should play less of a role. @Silvia024 @ashutosh3308 what do the scientists think? And more importantly: is there anything about this in the literature?sqrt(nAlignmentPoints)
or log(nAlignmentPoints)
as the factor to see whether it makes a difference.Yet another wild idea, this time on the "ground truth" dataset we use (both for testing and for the eventual writing of the paper).
If we start with one of the loot
pointclouds and split it into two tiles by randomly assigning each of the points to a tile we know exactly what the measured correspondence should be: it is the cell size of the original 1024*1024*1024
ply-file that we created the pointcloud from (the 8i files use integer coordinates, so we know the cellsize exactly).
"worst aligned camera" is currently determined by multiplying measured mis-alignment by the number of points in the alignment set. As you can see in the bottom graph above this is a bit too aggressive: camera 2 has a mis-alignment that is worse than camera 1 but it wasn't selected. We should try sqrt(nAlignmentPoints) or log(nAlignmentPoints) as the factor to see whether it makes a difference.
That seems to be a good idea, with log()
working better than sqrt()
. Here are the results for jack-sideways, first with using nAlignmentPoints
in the weight, then with using log(nAlignmentPoints)
. We not only did an extra step in the latter case, but also aligned different cameras at the previous steps. You can see that the latter got to a better result eventually.
But: we may be able to get even better results through completely different means. For example: if our previous algorithm didn't get anymore improvement (the end condition of the loop above) we just re-align all cameras once in no particular order. And if that resulted in any improvement we continue with our loop.
Note for the future: at the end of the re-registration run we have a correspondence. We should voxelize the pointcloud to this number (or possibly some function f()
of this number). The expectation is that this vozelized pontcloud will nicely have all the "synthesised cameras" for the tile overlaps. We can of course visually inspect this resulting point cloud, but we can probably also compute how well the whole process worked by running the registration analyser on it.
This pointcloud will have 15 virtual cameras, and the correspondence between those cameras should be very bad. Actually, the number to look at now is not so much the correspondence error but the number of points in the correspondence set: we are expecting only the points that are "on the edge" of the tile to have a low distance to the other tile clouds.
But: we may be able to get even better results through completely different means. For example: if our previous algorithm didn't get anymore improvement (the end condition of the loop above) we just re-align all cameras once in no particular order. And if that resulted in any improvement we continue with our loop.
Implemented this (looping over the other cameras after the worst camera doesn't show anymore improvement), and it turns out not to be a good idea.
In essence the only thing it does is to move some "badness" from one camera to another. So, each time through the loop the camera that's being re-aligned will improve a bit, but other cameras will get a bit worse. I'm not sure I understand why this happens...
And there's a problem about determining when to terminate.
Observations:
- All captured datasets seem to end with alignment errors of around
3.5 mm
. I wouldn't be surprised if this turns out to be the resolution of the Kinects at the given distance between camera and subject (around2 m
I think). Anyone willing to compute this number?
I think so, I found that it should be: "~ 3 mm per pixel x by y at 2 m"
A result of the previous point is that I don't think the ICP variants that take color into account are going to provide any benefit for us. @Silvia024 @ashutosh3308 what do the scientists think? need to check this. I don't have an answer now
It think
ICP-point2plane
may be able to do better than the3 mm
we are getting now, because the "plane" will be determined using some form of interpolation, and therefore the camera resolution at the given distance should play less of a role. @Silvia024 @ashutosh3308 what do the scientists think? And more importantly: is there anything about this in the literature?
From a quick research, I see that ICP-point2point
has generally worse performance than ICP-point2plane
but is fast to compute.
Will check better in the literature
Ideas:
-first use geometry-only ICP
, and then try with color to see if it gives any improvement @jackjansen
-try to check temporal consistency to improve alignment dynamically
-Publication idea: testing that the alignment metric works + benchmarking a bunch of registration algorithms
-With lots of data, see if we can bring the precision error to a minimum by optimizing over the matrices
This is the path for calibration algorithms: python/cwipc/scripts/cwipc_calibrate/calibrator.py (https://github.com/cwi-dis/cwipc_util/blob/18-alignment/python/cwipc/scripts/cwipc_calibrate/calibrator.py) - 18-alignment
Implemented ICP_Point2Plane, but it seems to be performing slightly worse than ICP_Point2Point.
I can see two possible reasons for this:
distance
and nNeigbhour
. I have no idea where the values of 0.02
and 30
that we currently use come from. They may be wildly incorrect for our use case (human bodies).I think the best way forward is to first create the loot-based test data set, described above.
Created the "simple" loot-based dataset (i.e. loot split into 4 halves). This showed enough problems with the algorithms that it kept me busy for a day:-)
One of the problems is that the "outer algorithm", the one that selects the camera to re-align, can start oscillating: fix camera 2 but thereby making camera 4 worse, then fixing camera 4 making camera 2 worse, etc. Currently the stop condition is based on the error of the worst camera (taking weight into account), maybe the stop condition should be based (or also based?) on the overall error?
And there is a problem of ever-diminishing returns for an extra run (which showed up with loot). This has been fixed for now by stopping if the error is less than 0.001
, but this number should really be based on the precision of the source point cloud.
And from a practical point of view: Maybe now is the time to turn cwipc_calibrate
and the sandbox/registration_error_*
code into a few classes, and create a new script cwipc_register
that uses those, to replace cwipc_calibrate
.
Initially the classes would use the current method of coarse calibration (selecting of the points), but the work on automating that could then happen in parallel.
Implemented part of
And from a practical point of view: Maybe now is the time to turn cwipc_calibrate and the sandbox/registrationerror* code into a few classes,
and also
Note for the future: at the end of the re-registration run we have a correspondence. We should voxelize the pointcloud to this number (or possibly some function f() of this number). The expectation is that this vozelized pointcloud will nicely have all the "synthesised cameras" for the tile overlaps.
The results are somewhat disappointing: for reasonable choices of f()
(I tried factor*correspondence
, for factors sqrt(0.5)
, 1
, sqrt(2)
, 2) there are not many points in the "intermediate" tiles. About 10%
or so when using 1*correspondence
. I have to go up to a fact or 4
or so before a significant number of points get into the new "synthesised intermediate camera" tiles. Even for the boxes dataset.
I have no idea why this happens, nor a good idea of how we could investigate it.
Got started on playing with Aruco markers. Created a dataset vrsmall-noreg
that has a capture of aruco-markers.
Looks like it's absolutely feasible to find the markers in the RGB images, but the (u,v)
indices in RGB-image-space cannot be converted to (x, y, z)
coordinates in PLY-space easily.
Three possible paths forward:
Also found a message from someone trying to do similar things to use: https://forum.openframeworks.cc/t/align-multiple-kinect-point-clouds-with-aruco-markers/41182
Interesting results also on fine-calibration.
Created a dataset vrsmall-worse
after tilting three cameras in three different planes (x,y,z). The resulting point clouds are clearly off.
The fixer managed to make them all upright, but the's where the good news stops. They're still off by quite a bit, and moreover (and worse): the analysis algorithm produces way too optimistic values.
That is the first problem to fix. It s clearest from examining the graphs in vrsmall-worse/boxes
and comparing the computed correspondence values to what a human can see from the plots: the values are way too optimistic.
The second thing to think about is the floor: whether we should include the floor in the capture. vrsmall-worse/boxes shows a reason to do so: it would probably have forestalled that the camera 4 point cloud (the green one) is positioned "too high". But on the other hand: having a lot of "floor points" matched may lead to the overall algorithm to be happy too sone ("I've matched up all these floor points, who cares about a few person points").
the analysis algorithm produces way too optimistic values.
Let's start by analysing this problem. Here's the initial distances histogram for vrsmall-worse/boxes
:
Visually it is clear that both camera 1 (blue) and 2 (orange) have a correspondence error of approximately 0.1
, but the analysis algorithm gives a value of around 0.04
.
Also, camera 8 (red) seems to have preciously few points in the correspondence set, making that result not very trustworthy. Similar to a lesser extent for camera 4 (green).
Wild idea: part of the problem is that we are computing correspondence on the whole point clouds S (source, the points from the camera that we are interested in) and D (destination, all the other camera point clouds combined).
If we could in stead compute the subset point clouds Soverlap and Doverlap consisting of the points that could reasonably be expected to be in the overlap between S and D this should help the computation of correspondence a lot: For each point in Soverlap we compute the distance to the nearest point in Doverlap, then compute mean and stddev of those distances. Those two numbers should allow is to compute a good candidate for the correspondence (probably something like mean + stddev/2
or so, but I'll leave that to the mathematicians:-).
Question to the scientists: does this make sense so far?
If the answer is yes, then here is another wild idea on how to compute Doverlap (and the same would work for Soverlap).
nNeighbour = 4 (or maybe 3, or 8, but some reasonably small value)
For each point in D add a field usage_count
for each point in S:
compute the nNeighbour nearest neighbours in D
for each of those nNeighbour neighbours
increment usage_count
Remove all points from D that have usage_count == 0
Remove all points from D that have usage_count > nNeighbour
Rationale: the points in D
with usage_count == 0
are apparently very far away from any point in S
(no point in S
selected them as a close neighbour), so we can get rid of them. On the other hand, the points with usage_count > some threshold
are apparently very close to S
, because many points in S
that do not have any "natural neighbours" in D
have selected them as one of the closest. So we should get rid of these points as well, because they will skew the mean and stddev later. Using nNeighbour
as some threshold
is a wild guess.
Scientists: does this sound reasonable?
I guess today is a wild-idea-day:-)
Got started on playing with Aruco markers. Created a dataset vrsmall-noreg that has a capture of aruco-markers.
I was thinking on how we could create images to search from the point clouds (as opposed to using the captured images "aux data", because we don't always have these, and also because as stated above it is difficult to convert the image (u, v)
coordinates back to spatial (x, y, z)
coordinates).
What if we create three planes, based on the camera coordinate system: x=0
, y=0
and z=0
. On each of these planes we make two distinct projections, for example for the x=0
plane we make one image where we project all points with x>=0
and ignore points with x<0
, etc. (Actually: we need only 5 planes, because for z=0
one of the planes would be empty as the camera can't see behind itself).
I guess we can initialise each of the planes to something funny (green?) and hope the aruco-finding algorithm will work around that.
And if it turns out that the previous idea doesn't work: we could try to find the floor, and then project just the points that are "close to" the floor onto an image that is parallel to the floor.
Finding the floor could be done using open3d.segment_plane
. In case there are more planes in the point cloud (for example wall or ceiling) we could find a couple of planes. There is sample code for iteratively finding planes in https://github.com/yuecideng/Multiple_Planes_Detection
Turning the pointclouds into images turns out to be complicated. Part of the problem is the different coordinate systems of open3d, opencv and cwipc (left-handed, right-handed, etc). Part of the problem is that we're using map-color-to-depth, so there is not a lot of resolution left in the pointcloud (basically the D-camera resolution, which is 640x576
is the vrsmall-noreg
capture. This isn't good enough to detect the aruco markers, at least not when projecting to the "camera plane".
The aruco-detector had no problem with the RGB images (1280-720
) so maybe we should capture the coarse calibration pointclouds with map-depth-to-color?
We know that we're looking at planes anyway, when we're looking for the markers.
And, actually, this may be true also for the fine calibration: having more intermediate points on the "surface" of the subject may be a benefit that outweighs the negative effect of having "invented" points (as mentioned above in
I just realised that we should always use map color to depth for our use case
That point still holds for production, I think, but maybe not for registration.
Paused working on the aruco markers for now.
Implemented a manual coarse calibration algorithm (basically the same as what we have: select the four coloured corners of the calibration target).
This works, the resulting pointcloud and the correspondence graph are in pointcloud-registration-test/vrsmall-noreg/capture
.
Here is the graph:
The correspondence errors are too optimistic, but visual inspection of the graph shows that something like 0.05
is probably the correct number.
But: these pointclouds are so big (500Kpoint per camera) that the analysis is very expensive: it takes about a minute on my M1 Mac, and I didn't wait for it on the Intel Mac. That's a shame, because for our common use case (capturing in a room where the walls are close enough to be captured by the cameras) this would actually be a very useful test capture...
In case we decide to go for detecting the Aruco markers in the RGB image and then finding the corresponding points in the point cloud: here are some interesting links about how the conversion of image pixel coordinates to 3D points can be done using librealsense API's:
https://github.com/IntelRealSense/librealsense/issues/8221 https://medium.com/@yasuhirachiba/converting-2d-image-coordinates-to-3d-coordinates-using-ros-intel-realsense-d435-kinect-88621e8e733a https://github.com/IntelRealSense/librealsense/issues/11031
Back to fine registration.
There is a thinking error in the idea that only looking at the correspondence between each (camera-N, all-cameras-except-N)
is good enough.
This is shown clearly in a new dataset offline-boxes
. See there for the full details, but here is a screenshot of the calibration it managed to come up with:
Visually, it is clear what the problem 1 is: the green and blue camera together (or the red and brown camera together) need to be moved to make things better. But looking at each camera in isolation is never going to work: if we would move the green camera to better align with the red one we would completely lose the alignment with the blue camera.
There is also a problem 2 with the analysis. Here is what our code thinks of the alignment it came up with:
Those numbers (14 mm
to 18 mm
) are far too optimistic.
For problem 1 a potential solution is to not only look at correspondences between (camera-N, all-cameras-except-N)
but also (camera-N-and-M, all-cameras-except-N-and-M)
for every combination (N, M)
where those two cameras are "closest to each other". If and N-and-M
camera combination would end up as candidate-to-fix we would try fixing that "camera" and apply the transformation to both cameras.
But this feels like a hacker-jack solution: it seems like it might fix this problem but I'm not really sure it's really a solution and not a quick hack.
I have the feeling that tackling problem 2 first may be better.
Ideas, anyone?
Some progress with coarse calibration.
I'm manually positioning the pointcloud in the open3d viewer so that the virtual camera is approximately where the physical camera was. I then grab the RGB and D images.
I can now detect the aruco markers in the RGB image.
But this is where the good news stops: the next step is converting the u, v
coordinates that the aruco detector returns (coordinates in the RGB image) back to x, y, z
coordinates in the point cloud.
I think I'm doing all the right things: I'm taking the depth from D[v, u]
and using the camera intrinsics to de-project, and then I'm transforming with the extrinsic matrix.
But I always end up with weird origins. I've tried transposing the extrinsic matrix, I've tried mirroring x
, or mirroring y
and z
(which was suggested somewhere) but nothing seems to work.
See https://github.com/isl-org/Open3D/issues/6508 for a decent description of the open3d coordinate system (RH y-UP) and its idiosyncrasies with respect to lookat and forward.
We need to fix the alignment once and for all.
The original issue is https://gitlab.com/VRTogether_EU/cwipc/cwipc_util/-/issues/41but this link is dead now.There is an old repo for experiments at https://github.com/cwi-dis/pointclouds-alignment
New experiment data is at https://github.com/cwi-dis/cwipc_test/tree/master/pointcloud-registration-test
Current plan of attack:
cameraconfig.json
. Also record there the current misalignment, because it is a good value for voxelization (later, during production).