Long running time of M3C2 on entire point cloud compared to CloudCompare

chrise96 commented 2 years ago

With the help of this git issue I'm able to run M3C2 algorithm with (I think) the same params used in CloudCompare. However, the M3C2 algorithm takes roughly 100x times longer than the implementation of CloudCompare.

Cloudcompare M3C2: ~2 seconds py4dgeo M3C2: 271 seconds

I put all the files to recreate the experiment here.

Is the time difference caused by a param that I forgot to configure in the py4dgeo implementation? Here is the config file (default settings exported from CloudCompare): m3c2_params.txt

dokempf commented 2 years ago

Thanks for providing all the configuration and data. I will reproduce and investigate this next week.

dokempf commented 2 years ago

Hey @chrise96 Thanks again for providing test data and configuration, this has been really helpful. I found a few things that went in favor of CloudCompare in your comparison - some can be fixed, some can be documented and some will need future work in py4dgeo (remember we are in early dev):

CloudCompare's scale parameters are diameters, while py4dgeo uses radii. This lead to py4dgeo operating with a 4x larger search cylinder. I am currently fixing this for the CloudCompareM3C2 class in #129, but you can also just divide your radii by 2.
Your M3C2 configuration contains SubsampleEnabled=true which means that you are not using the input cloud as the set of core points, but a downsampled version of it that contains only a fraction of points (in my testing with CC and your data, only 1%). You are explicitly telling py4dgeo to use the entire point cloud though with corepoints = epoch1.cloud. The M3C2 algorithm is linear in the number of corepoints which makes this one particularly important. Can you doublecheck the number of corepoints from the CC logs?
CloudCompare has a (quite undocumented) option called UseSinglePass4Depth that if set to false enables a performance optimization that py4dgeo has not (yet) implemented (see #88 ). You might want to set that to true to better compare against py4dgeos current state.

Here is a modified version of your notebook. It does the same thing, only that it splits py4dgeos application of M3C2 into a few substeps: Searchtree construction, Normal Calculation, Distance Calculation. All of these have always been performed, but lazily evaluated during run().

Can you run again on your end and see how performance compares?

chrise96 commented 2 years ago

Thank you for the very detailed update!

I see now indeed that the config .txt file I provided contains SubsampleEnabled=true, this must be false... I updated the branch with this change. It takes now 19.5 seconds in CloudCompare.

I didn't know about this UseSinglePass4Depth option (In the advanced tab in M3C2 CloudCompare "Do not use multiple pass for depth").

Here a complete screenshot of the modified notebook run (dividing radii by 2 really speeds it up):

dokempf commented 2 years ago

I already feared it would not be as easy as the downsampling setting :disappointed:.

I am assuming you run this on Windows - correct? I made some tests between Linux and Windows on the same machine (dual boot, no virtualization) and found the results to be quite surprising:

Setup	py4dgeo Normals	CC Normals	py4dgeo Distances	CC Distances
Windows 6 Threads	13s	6s	127s	30s
Windows 1 Thread	28.4s	--	275s	--
Windows 6 Threads (Blocking)	10s	--	97s	--
Linux 6 Threads	2.7s	--	34s	--
Linux 1 Thread	14s	--	201s	--

I conclude that we have a toolchain issue on Windows that introduces a significant performance penalty. There is a multithreading related aspect to it (Linux scales roughly optimal, Windows not at all), but sequential performance is also clearly affected. The Blocking variant in above table lets OpenMPs dynamic scheduler work on chunks of 128 corepoints. My next experiments will be to vary the Windows toolchain to get a better understanding of where the problem might be.

chrise96 commented 2 years ago

Oke, I run on macOS.

chrise96 commented 2 years ago

The M3C2 distance results in py4dgeo are very different compared to the CloudCompare results. Points in some static objects, for example a street sign in the provided point clouds, do not come close to the 0 value for the M3C2 distance. How do you choose the best configuration params for py4dgeo?

3dgeo-heidelberg / py4dgeo

Long running time of M3C2 on entire point cloud compared to CloudCompare #128