PointCloudLibrary / pcl

Point Cloud Library (PCL)
https://pointclouds.org/
Other
9.92k stars 4.61k forks source link

[custom] How can I debug crashes in which deallocation of point clouds uses the wrong point type? #5579

Open themightyoarfish opened 1 year ago

themightyoarfish commented 1 year ago

I often have this issue – not reproducible minimally – that my PCL client code crashes with some sort of

pointer being freed was not allocated

message, that usually goes away when I #define PCL_NO_PRECOMPILE at the top of my offending source file. With clang's address sanitizer, I get interesting reports that seem to indicate that a point cloud created as one point type gets deallocated as a different one. For instance check out this backtrace:

==47757==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x0001270383f8 at pc 0x000100af7ad8 bp 0x00016f4acd30 sp 0x00016f4acd28
READ of size 8 at 0x0001270383f8 thread T0
    #0 0x100af7ad4 in Eigen::internal::handmade_aligned_free(void*) Memory.h:118
    #1 0x100af7a5c in Eigen::internal::aligned_free(void*) Memory.h:206
    #2 0x100e01ce8 in Eigen::aligned_allocator<pcl::PointXYZRGB>::deallocate(pcl::PointXYZRGB*, unsigned long) Memory.h:897
    #3 0x100e01b18 in std::__1::allocator_traits<Eigen::aligned_allocator<pcl::PointXYZRGB> >::deallocate(Eigen::aligned_allocator<pcl::PointXYZRGB>&, pcl::PointXYZRGB*, unsigned long)
allocator_traits.h:282
    #4 0x100e01804 in std::__1::__vector_base<pcl::PointXYZRGB, Eigen::aligned_allocator<pcl::PointXYZRGB> >::~__vector_base() vector:488
    #5 0x100e015c4 in std::__1::vector<pcl::PointXYZRGB, Eigen::aligned_allocator<pcl::PointXYZRGB> >::~vector() vector:579
    #6 0x100e00f70 in std::__1::vector<pcl::PointXYZRGB, Eigen::aligned_allocator<pcl::PointXYZRGB> >::~vector() vector:574
    #7 0x100e03294 in pcl::PointCloud<pcl::PointXYZRGB>::~PointCloud() point_cloud.h:172
    #8 0x100e03254 in pcl::PointCloud<pcl::PointXYZRGB>::~PointCloud() point_cloud.h:172
    #9 0x103cc6510 in pcl::Filter<pcl::PointXYZRGB>::filter(pcl::PointCloud<pcl::PointXYZRGB>&) filter.h:134
    #10 0x103c6e6a8 in myfunction(mydata*) myfile.cpp:122
…
    #14 0x114051088 in start+0x204 (dyld:arm64e+0x5088)

0x0001270383f8 is located 8 bytes to the left of 52256-byte region [0x000127038400,0x000127045020)
allocated by thread T0 here:
    #0 0x120656ca8 in wrap_malloc+0x94 (libclang_rt.asan_osx_dynamic.dylib:arm64e+0x3eca8)
    #1 0x1169024d4 in std::__1::__split_buffer<pcl::PointXYZL, Eigen::aligned_allocator<pcl::PointXYZL>&>::__split_buffer(unsigned long, unsigned long, Eigen::aligned_allocator<pcl::Poi
ntXYZL>&)+0x30 (libpcl_filters.1.13.0.99.dylib:arm64+0x464d4)
    #2 0x1169037e4 in std::__1::vector<pcl::PointXYZRGB, Eigen::aligned_allocator<pcl::PointXYZRGB> >::__append(unsigned long)+0x68 (libpcl_filters.1.13.0.99.dylib:arm64+0x477e4)
    #3 0x116a89e94 in pcl::VoxelGrid<pcl::PointXYZRGB>::applyFilter(pcl::PointCloud<pcl::PointXYZRGB>&)+0xaf0 (libpcl_filters.1.13.0.99.dylib:arm64+0x1cde94)
    #4 0x103cc63a4 in pcl::Filter<pcl::PointXYZRGB>::filter(pcl::PointCloud<pcl::PointXYZRGB>&) filter.h:129
    #5 0x103c6e6a8 in myfunction(mydata*) myfile.cpp:122

SUMMARY: AddressSanitizer: heap-buffer-overflow Memory.h:118 in Eigen::internal::handmade_aligned_free(void*)
Shadow bytes around the buggy address:
  0x007024e27020: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x007024e27030: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x007024e27040: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x007024e27050: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x007024e27060: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
=>0x007024e27070: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa[fa]
  0x007024e27080: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x007024e27090: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x007024e270a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x007024e270b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x007024e270c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
==47757==ABORTING
Abort trap: 6

So why is the deallocator called on the correct PointXYZRGB points, but the filter class somehow creates data of type PointXYZL? The presence or absence of -DPCL_ONLY_CORE_POINT_TYPES=ON and -DNO_EXPLICIT_INSTANTIATIONS makes no difference by the way.

What can I do to debug this?

mvieth commented 1 year ago

Hm, no idea. Do you do anything with PointXYZL that could lead to the mixup? Could this be a problem with clang's ASAN or with clang itself? Or the Apple versions, respectively? I checked the relevant places in the PCL code, and I can't find anything suspicious.

themightyoarfish commented 1 year ago

Not once in my life have i used PointXYZL … hence my confusion. Maybe it's asan-related (seems most plausible explanation to me), but the crash exists also without it, it's just unclear then what exactly fails.

themightyoarfish commented 1 year ago

However is it not strange then that defining PCL_NO_PRECOMPILE makes it go away?

mvieth commented 1 year ago

Maybe you can try using valgrind instead of compiling with ASAN, to see if you still see PointXYZL in the report.

Regarding the crash itself: Which flags do you use when building PCL and which flags when building your own project? (SSE and AVX flags, and everything else that could be alignment-related). Usually these types of errors happen if these flags don't match

themightyoarfish commented 1 year ago

good idea, although so far I think I've only seen it on macos where valgrind is unavailable.

PCL is compiled from source (partly for this reason) and uses -march=native (or -mtune=native on arm) just like my client code.

themightyoarfish commented 1 year ago

It's also weird that from what I can see, the PCL_INSTANTIATE_PointCloud() macro is never used in the codebase, so it's not like there's precompiled instantiations with different point types are lying around and somehow can get mixed up.

themightyoarfish commented 1 year ago

I wonder if it is coincidental that PointXYZL and PointXYZRGB are neighbours in this list

#define PCL_XYZ_POINT_TYPES   \
  (pcl::PointXYZ)             \
  (pcl::PointXYZI)            \
  (pcl::PointXYZL)            \
  (pcl::PointXYZRGBA)         \
…
themightyoarfish commented 1 year ago

Once I have this reappear, I will try again on Linux, but I don't think I've ever seen this issue outside of macOS ARM. I suppose a compiler / linker bug is not completely out of the questions, as it is a new-ish platform.

mvieth commented 1 year ago

good idea, although so far I think I've only seen it on macos where valgrind is unavailable.

Hm, that's annoying

PCL is compiled from source (partly for this reason) and uses -march=native (or -mtune=native on arm) just like my client code.

Just to be sure: everything happens on the same computer? No cross-compilation?

It's also weird that from what I can see, the PCL_INSTANTIATE_PointCloud() macro is never used in the codebase, so it's not like there's precompiled instantiations with different point types are lying around and somehow can get mixed up.

True, although if e.g. VoxelGrid gets instantiated for some type, it will likely also instantiate PointCloud for the same type.

I wonder if it is coincidental that PointXYZL and PointXYZRGB are neighbours in this list

#define PCL_XYZ_POINT_TYPES   \
  (pcl::PointXYZ)             \
  (pcl::PointXYZI)            \
  (pcl::PointXYZL)            \
  (pcl::PointXYZRGBA)         \
…

That's PointXYZRGBA, while PointXYZRGB is one below in the list. In your logs I only see PointXYZRGB. I guess you can try to shuffle this list and see whether ASAN still reports PointXYZL or another type

themightyoarfish commented 1 year ago

Oops yeah I misread. Can confirm, everything is compiled locally on the same machine.

themightyoarfish commented 1 year ago

This same problem (at least I assume its the same, as the crash is at the exact same location) produces an lldb backtrace like this

* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
  * frame #0: 0x00000001a400ad98 libsystem_kernel.dylib`__pthread_kill + 8
    frame #1: 0x00000001a403fee0 libsystem_pthread.dylib`pthread_kill + 288
    frame #2: 0x00000001a3f7a340 libsystem_c.dylib`abort + 168
    frame #3: 0x00000001a3e5c8c0 libsystem_malloc.dylib`malloc_vreport + 552
    frame #4: 0x00000001a3e5ff34 libsystem_malloc.dylib`malloc_report + 64
    frame #5: 0x00000001a3e4ecf4 libsystem_malloc.dylib`free + 300
    frame #6: 0x00000001000dbab0 autocrane-core_main`Eigen::internal::handmade_aligned_free(ptr=0x0000600000087850) at Memory.h:118:5
    frame #7: 0x00000001000dba7c autocrane-core_main`Eigen::internal::aligned_free(ptr=0x0000600000087850) at Memory.h:206:5
    frame #8: 0x00000001009be030 autocrane-core_main`Eigen::aligned_allocator<pcl::PointXYZ>::deallocate(this=0x000060000369c7f8, p=0x0000600000087850, (null)=1) at Memory.h:897:5
    frame #9: 0x00000001009bdee0 autocrane-core_main`std::__1::allocator_traits<Eigen::aligned_allocator<pcl::PointXYZ> >::deallocate(__a=0x000060000369c7f8, __p=0x0000600000087850, __n=1) at allocator_traits.h:282:13
    frame #10: 0x00000001009bdd28 autocrane-core_main`std::__1::__vector_base<pcl::PointXYZ, Eigen::aligned_allocator<pcl::PointXYZ> >::~__vector_base(this=0x000060000369c7e8) at vector:488:9
    frame #11: 0x00000001009bdbfc autocrane-core_main`std::__1::vector<pcl::PointXYZ, Eigen::aligned_allocator<pcl::PointXYZ> >::~vector(this=0x000060000369c7e8 size=0) at vector:579:5
    frame #12: 0x00000001009bd9f0 autocrane-core_main`std::__1::vector<pcl::PointXYZ, Eigen::aligned_allocator<pcl::PointXYZ> >::~vector(this=0x000060000369c7e8 size=0) at vector:574:5
    frame #13: 0x00000001009beaa8 autocrane-core_main`pcl::PointCloud<pcl::PointXYZ>::~PointCloud(this=0x000060000369c7c0) at point_cloud.h:172:21
    frame #14: 0x00000001009bea78 autocrane-core_main`pcl::PointCloud<pcl::PointXYZ>::~PointCloud(this=0x000060000369c7c0) at point_cloud.h:172:21
    frame #15: 0x00000001013593f8 autocrane-core_main`std::__1::__shared_ptr_emplace<pcl::PointCloud<pcl::PointXYZ>, Eigen::aligned_allocator<pcl::PointCloud<pcl::PointXYZ> > >::__on_zero_shared(this=0x000060000369c7a0) at shared_ptr.h:315:24
   …
<mycode>
…
    frame #24: 0x000000010aee908c dyld`start + 520

But it does not seem to be alignment-related, as the pointer to deallocate is divisible by 16, which is the default alignment (and the one used here) on macos ARM.

themightyoarfish commented 1 year ago

With a debug build of pcl, the code fails at a different point, which is after the one reported above:

* thread #1, queue = 'com.apple.main-thread', stop reason = hit program assert
    frame #0: 0x00000001a400ad98 libsystem_kernel.dylib`__pthread_kill + 8
    frame #1: 0x00000001a403fee0 libsystem_pthread.dylib`pthread_kill + 288
    frame #2: 0x00000001a3f7a340 libsystem_c.dylib`abort + 168
    frame #3: 0x00000001a3f79754 libsystem_c.dylib`__assert_rtn + 272
  * frame #4: 0x00000001013ea3e8 autocrane-core_main`void Eigen::MapBase<Eigen::Ref<Eigen::Matrix<float, -1, 1, 0, 4096, 1> const, 32, Eigen::InnerStride<1> >, 0>::checkSanity<Eigen::Ref<Eigen::Matrix<float, -1, 1, 0, 4096, 1> const, 32, Eigen::InnerStride<1> > >(this=0x000000016fdf1110, (null)=0x0000000000000000) > (0)), void*>::type) const at MapBase.h:196:7
    frame #5: 0x00000001013ea2f4 autocrane-core_main`Eigen::MapBase<Eigen::Ref<Eigen::Matrix<float, -1, 1, 0, 4096, 1> const, 32, Eigen::InnerStride<1> >, 0>::MapBase(this=0x000000016fdf1110, dataPtr=0x000000016fdf1130, rows=42, cols=1) at MapBase.h:177:7
    frame #6: 0x00000001013eb074 autocrane-core_main`Eigen::MapBase<Eigen::Ref<Eigen::Matrix<float, -1, 1, 0, 4096, 1> const, 32, Eigen::InnerStride<1> >, 0>::MapBase(this=0x000000016fdf1110, dataPtr=0x000000016fdf1130, rows=42, cols=1) at MapBase.h:173:5
    frame #7: 0x00000001013ea6a0 autocrane-core_main`void Eigen::RefBase<Eigen::Ref<Eigen::Matrix<float, -1, 1, 0, 4096, 1> const, 32, Eigen::InnerStride<1> > >::construct<Eigen::Matrix<float, -1, 1, 0, 4096, 1> >(this=0x000000016fdf1110, expr=0x000000016fdf1130) at Ref.h:109:40
    frame #8: 0x00000001013ea1f4 autocrane-core_main`void Eigen::Ref<Eigen::Matrix<float, -1, 1, 0, 4096, 1> const, 32, Eigen::InnerStride<1> >::construct<Eigen::Block<Eigen::Matrix<float, -1, 1, 0, -1, 1> const, -1, 1, false> >(this=0x000000016fdf1110, expr=0x000000016fdf1050, (null)=false_type @ 0x000000016fdf0f9f) at Ref.h:277:13
    frame #9: 0x00000001013ea110 autocrane-core_main`Eigen::Ref<Eigen::Matrix<float, -1, 1, 0, 4096, 1> const, 32, Eigen::InnerStride<1> >::Ref<Eigen::Block<Eigen::Matrix<float, -1, 1, 0, -1, 1> const, -1, 1, false> >(this=0x000000016fdf1110, expr=0x000000016fdf1050, (null)=0x0000000000000000)(Traits::match<Eigen::Block<Eigen::Matrix<float, -1, 1, 0, -1, 1> const, -1, 1, false> >::ScalarTypeMatch), Eigen::Block<Eigen::Matrix<float, -1, 1, 0, -1, 1> const, -1, 1, false> >::type*) at Ref.h:253:7
    frame #10: 0x00000001013e4b64 autocrane-core_main`Eigen::Ref<Eigen::Matrix<float, -1, 1, 0, 4096, 1> const, 32, Eigen::InnerStride<1> >::Ref<Eigen::Block<Eigen::Matrix<float, -1, 1, 0, -1, 1> const, -1, 1, false> >(this=0x000000016fdf1110, expr=0x000000016fdf1050, (null)=0x0000000000000000)(Traits::match<Eigen::Block<Eigen::Matrix<float, -1, 1, 0, -1, 1> const, -1, 1, false> >::ScalarTypeMatch), Eigen::Block<Eigen::Matrix<float, -1, 1, 0, -1, 1> const, -1, 1, false> >::type*) at Ref.h:249:5
    frame #11: 0x00000001013e4674 autocrane-core_main`void Eigen::internal::stable_norm_impl_inner_step<Eigen::Matrix<float, -1, 1, 0, -1, 1>, float>(vec=0x000000016fdf52d8, ssq=0x000000016fdf5174, scale=0x000000016fdf517c, invScale=0x000000016fdf5178) at StableNorm.h:81:34
    frame #12: 0x00000001013e44e4 autocrane-core_main`Eigen::Matrix<float, -1, 1, 0, -1, 1>::RealScalar Eigen::internal::stable_norm_impl<Eigen::Matrix<float, -1, 1, 0, -1, 1> >(vec=0x000000016fdf52d8, (null)=0x0000000000000000) at StableNorm.h:101:3
    frame #13: 0x00000001013e41d8 autocrane-core_main`Eigen::MatrixBase<Eigen::Matrix<float, -1, 1, 0, -1, 1> >::stableNorm(this=0x000000016fdf52d8) const at StableNorm.h:219:10
    frame #14: 0x000000010145c6b0 autocrane-core_main`Eigen::LevenbergMarquardt<Eigen::NumericalDiff<pcl::registration::TransformationEstimationLM<pcl::PointXYZ, pcl::PointXYZ, float>::OptimizationFunctorWithIndices, (Eigen::NumericalDiffMode)0>, float>::minimizeInit(this=0x000000016fdf52b8, x=0x000000016fdf5438) at LevenbergMarquardt.h:203:18
    frame #15: 0x000000010145bebc autocrane-core_main`Eigen::LevenbergMarquardt<Eigen::NumericalDiff<pcl::registration::TransformationEstimationLM<pcl::PointXYZ, pcl::PointXYZ, float>::OptimizationFunctorWithIndices, (Eigen::NumericalDiffMode)0>, float>::minimize(this=0x000000016fdf52b8, x=0x000000016fdf5438) at LevenbergMarquardt.h:160:46
    frame #16: 0x00000001013e0570 autocrane-core_main`pcl::registration::TransformationEstimationLM<pcl::PointXYZ, pcl::PointXYZ, float>::estimateRigidTransformation(this=0x0000600002924118, cloud_src=0x000060000302cc80, indices_src=size=42, cloud_tgt=0x0000600003017980, indices_tgt=size=42, transformation_matrix=0x000000011d204cf0) const at transformation_estimation_lm.hpp:196:17
    frame #17: 0x00000001013e088c autocrane-core_main`pcl::registration::TransformationEstimationLM<pcl::PointXYZ, pcl::PointXYZ, float>::estimateRigidTransformation(this=0x0000600002924118, cloud_src=0x000060000302cc80, cloud_tgt=0x0000600003017980, correspondences=size=42, transformation_matrix=0x000000011d204cf0) const at transformation_estimation_lm.hpp:235:3
    frame #18: 0x000000010136b40c autocrane-core_main`pcl::IterativeClosestPoint<pcl::PointXYZ, pcl::PointXYZ, float>::computeTransformation(this=0x000000011d204c20, output=0x000060000302cda0, guess=0x000000016fdf5e70) at icp.hpp:218:33
    frame #19: 0x0000000101466110 autocrane-core_main`pcl::Registration<pcl::PointXYZ, pcl::PointXYZ, float>::align(this=0x000000011d204c20, output=0x000060000302cda0, guess=0x000000016fdf5e70) at registration.hpp:212:3

Here it complains about actual alignment issues. However this may be unrelated. The code is executed inside an OpenMP parallel for, and removing that makes this error go away.