Open RozDavid opened 3 years ago
Hey @ToniRV,
Thanks a lot for the feedback and also for the ideas. I think it would be really nice to have this runtime operation, but you are also perfectly right this can't couse perfromance loss.
I am sadly not a pro here, but did a short research on Eigen Dynamic memory allocation and found this. Source.
Here is this constructor:
inline DenseStorage(int size, int rows, int) : m_data(internal::aligned_new<T>(size)), m_rows(rows) {}
Here, the m_data member is the actual array of coefficients of the matrix. As you see, it is dynamically allocated. Rather than calling new[] or malloc(), as you can see, we have our own internal::aligned_new defined in src/Core/util/Memory.h. What it does is that if vectorization is enabled, then it uses a platform-specific call to allocate a 128-bit-aligned array, as that is very useful for vectorization with both SSE2 and AltiVec. If vectorization is disabled, it amounts to the standard new[].
I believe if we don't change the directives EIGEN_DONT_ALIGN
and EIGEN_MAX_ALIGN_BYTES
(source) in the CMake build or with #defines, the memory allignment should be allright and the matrices will be automatically vectorized.
What do you think of this? Do you think if I replace the initialization with internal::aligned_new<T>
that could bridge the performance gap?
@RozDavid I also understood that when re-reading Eigen, the operations should already be vectorized... So it might well be we can't improve on that. It's fine for me, I think this adds a lot of flexibility anyway.
Another thing we should be careful is on the map size: having a 128-bit aligned array on a per-voxel basis may increase dramatically the size of the volumetric map. Maybe you could try to generate a map with and without this feature and see how many Mb is one with respect to the other? The rosservice save_map
should just do that.
Hello @ToniRV,
I ran a few quick tests comparing the static and dynamic approaches with different number of labels for the probability matrices.
So when tested with static I recompiled the code with kTotalNumberOfLabels=128
, changed the hardcoded prior to the apropriate number, and tested the same 21 + 128 setting with the dynamic sizes as well. Run the full rosbag, saved the layer in a vxblx file and copied the timings for the last pointcloud integration with the max. number of initialized blocks. The stats are copied to the end of this comment.
It was a bit surprising for me, that there is no difference between the layer sizes, only in the integration times. My interpretation is that Eigen allocates fixed sized memory up until a certain size of matrix (turns out this might be a bigger treshold than 128).
As all the vxblx file sizes are between 66.3 and 67.2 mb, the only difference here is the number of allocated blocks and about 15% performance loss in pointcloud integration.
Surely it depends on the use case if the flexibility is worth the performance loss or not, but I wanted to share this with you either way if you choose to merge or not.
The results can be compared here:
########## Dynamic 21 ##########
Vxblx file size: 67.1Mb
[ INFO] [1612890061.997756189, 121.805000000]: Integrating a pointcloud with 345600 points.
[ INFO] [1612890062.089571026, 121.805000000]: Finished integrating in 0.091762 seconds, have 1207 blocks.
[ INFO] [1612890062.089766452, 121.805000000]: Timings:
SM Timing
-----------
inserting_missed_blocks 285 00.001599 (00.000006 +- 00.000001) [00.000001,00.000136]
integrate/fast 285 27.095124 (00.095071 +- 00.004953) [00.058792,00.243477]
mesh/publish 399 01.489049 (00.003732 +- 00.002256) [00.000006,00.013048]
mesh/update 399 03.108085 (00.007790 +- 00.002853) [00.000171,00.018809]
ptcloud_preprocess 285 02.596166 (00.009109 +- 00.000871) [00.008302,00.036305]
remove_distant_blocks 285 00.019437 (00.000068 +- 00.000045) [00.000004,00.000402]
[ INFO] [1612890062.089818329, 121.805000000]: Layer memory: 59385627
[ INFO] [1612890062.089849037, 121.805000000]: Updating mesh.
########## Dynamic 128 ##########
Vxblx file size: 66.3Mb
[ INFO] [1612889885.600096098, 121.805000000]: Integrating a pointcloud with 345600 points.
[ INFO] [1612889886.008610556, 121.805000000]: Finished integrating in 0.408455 seconds, have 1205 blocks.
[ INFO] [1612889886.008808061, 121.805000000]: Timings:
SM Timing
-----------
inserting_missed_blocks 83 00.001354 (00.000016 +- 00.000028) [00.000001,00.000178]
integrate/fast 83 36.546054 (00.440314 +- 00.047001) [00.274163,00.750914]
mesh/publish 92 00.422405 (00.004591 +- 00.001759) [00.000006,00.011521]
mesh/update 92 00.844049 (00.009174 +- 00.001211) [00.000178,00.014639]
ptcloud_preprocess 83 00.757355 (00.009125 +- 00.001122) [00.008222,00.022541]
remove_distant_blocks 83 00.006077 (00.000073 +- 00.000031) [00.000002,00.000203]
[ INFO] [1612889886.008843359, 121.805000000]: Layer memory: 59287225
########## Static 21 ##########
Vxblx file size: 67.2Mb
[ INFO] [1612890454.482768384, 121.805000000]: Integrating a pointcloud with 345600 points.
[ INFO] [1612890454.564294501, 121.805000000]: Finished integrating in 0.081479 seconds, have 1209 blocks.
[ INFO] [1612890454.564449636, 121.805000000]: Timings:
SM Timing
-----------
inserting_missed_blocks 290 00.001434 (00.000005 +- 00.000001) [00.000001,00.000117]
integrate/fast 290 24.979322 (00.086136 +- 00.004859) [00.053003,00.161756]
mesh/publish 493 01.521718 (00.003087 +- 00.003551) [00.000007,00.008997]
mesh/update 493 03.285606 (00.006665 +- 00.005909) [00.000138,00.017277]
ptcloud_preprocess 290 02.654694 (00.009154 +- 00.002964) [00.008272,00.034363]
remove_distant_blocks 290 00.018787 (00.000065 +- 00.000037) [00.000002,00.000652]
[ INFO] [1612890454.564478470, 121.805000000]: Layer memory: 59484029
########## Static 128 ##########
Vxblx file size: 66.5Mb
[ INFO] [1612891013.389336555, 121.805000000]: Integrating a pointcloud with 345600 points.
[ INFO] [1612891013.729743330, 121.805000000]: Finished integrating in 0.340356 seconds, have 1204 blocks.
[ INFO] [1612891013.729900746, 121.805000000]: Timings:
SM Timing
-----------
inserting_missed_blocks 97 00.001169 (00.000012 +- 00.000019) [00.000001,00.000123]
integrate/fast 97 35.617008 (00.367186 +- 00.036256) [00.230930,00.561777]
mesh/publish 106 00.547046 (00.005161 +- 00.001862) [00.000006,00.012017]
mesh/update 106 01.092757 (00.010309 +- 00.001616) [00.000182,00.018571]
ptcloud_preprocess 97 00.884389 (00.009117 +- 00.000556) [00.008329,00.018017]
remove_distant_blocks 97 00.005814 (00.000060 +- 00.000037) [00.000003,00.000304]
[ INFO] [1612891013.729936872, 121.805000000]: Layer memory: 59238024
[ INFO] [1612891013.729965223, 121.805000000]: Updating mesh.
Addressing my own issue #55, using dynamic sized semantic prior matrices and initialization in runtime. Reading number of labels from the launch file as a rosparam.
I believe the biggest concern could be the performance issue, for this here is a comparison of timings on the provided simulated semantic rosbag file.
With static labels
With dynamic labels