ROCm / omnitrace

Omnitrace: Application Profiling, Tracing, and Analysis
https://rocm.docs.amd.com/projects/omnitrace/en/latest/
MIT License
297 stars 27 forks source link

Segmentation fault in multi-threaded code #304

Closed sfantao closed 1 year ago

sfantao commented 1 year ago

I have an application that uses up to 7-threads and I randomly get segmentation faults from omnitrace version 1.10.2 coming from:

(gdb) #0  0x000015553b20bcae in std::_Hashtable<unsigned long, std::pair<unsigned long const, long>, std::allocator<std::pair<unsigned long const, long> >, std::__detail::_Select1st, std::equal_to<unsigned long>, std::hash<unsigned long>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, false, true> >::_M_find_before_node () at /usr/include/c++/7/bits/hashtable.h:1551
#1  std::_Hashtable<unsigned long, std::pair<unsigned long const, long>, std::allocator<std::pair<unsigned long const, long> >, std::__detail::_Select1st, std::equal_to<unsigned long>, std::hash<unsigned long>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, false, true> >::_M_find_node ()
    at /usr/include/c++/7/bits/hashtable.h:642
#2  0x000015553b9765dd in std::_Hashtable<unsigned long, std::pair<unsigned long const, long>, std::allocator<std::pair<unsigned long const, long> >, std::__detail::_Select1st, std::equal_to<unsigned long>, std::hash<unsigned long>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, false, true> >::find () at /usr/include/c++/7/bits/hashtable.h:1425
#3  std::unordered_map<unsigned long, long, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long const, long> > >::find () at /usr/include/c++/7/bits/unordered_map.h:920
#4  omnitrace::hip_activity_callback ()
    at /home/omnitrace/source/lib/omnitrace/library/roctracer.cpp:927
#5  0x000015553aff8236 in ?? ()
   from /pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.4.3/lib/libroctracer64.so.4
#6  0x000015553aff944c in ?? ()
   from /pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.4.3/lib/libroctracer64.so.4
#7  0x0000155554fe2a33 in ?? ()
   from /pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.4.3/deps/libstdc++.so.6
#8  0x000015553bfd7d54 in omnitrace::component::pthread_create_gotcha::wrapper::operator() ()
    at /home/omnitrace/source/lib/omnitrace/library/components/pthread_create_gotcha.cpp:276
#9  0x000015553bfd9402 in omnitrace::component::pthread_create_gotcha::wrapper::wrap ()
    at /home/omnitrace/source/lib/omnitrace/library/components/pthread_create_gotcha.cpp:305
#10 0x000015554bed06ea in start_thread () from /lib64/libpthread.so.0
#11 0x000015554bbe8a6f in clone () from /lib64/libc.so.6

This app was executed as:

omnitrace-sample --trace -- neko.exe hemi.case

I believe there might be a race going into this unordered map. This comes from an app that is not trivial to build. Let me know if you'd like to provide more information about the SEGFault or the app itself.

jrmadsen commented 1 year ago

Ah, yeah, I see why/how the data race is happening... there is a lock but on different mutexes. I can get it patched easily and I’ll generate a new release

sfantao commented 1 year ago

Good stuff! Let me know how to get the patched version.

gmarkomanolis commented 1 year ago

Jonathan, this is the Neko code, I was discussing the same with Niclas today. Please inform us when there is a change.

jrmadsen commented 1 year ago

It is being held up by whatever happened to the build system on RedHat in HIP 5.5 and 5.6, seen in #300. Looks like some amdgpu libraries got moved and aren’t being found. I’m trying to get to solving it soon

jrmadsen commented 1 year ago

Ok, I finally found the time to sort out #300. I’ll get that merged shortly and then addressing this will be quick and easy. There should be a release available tomorrow

jrmadsen commented 1 year ago

It’s going to be a little while longer until I figure out how to solve the packaging and code coverage routinely running out of disk space.

jinhongyii commented 1 year ago

Hi @jrmadsen, I also encounters almost the same bug when profiling my multi-threaded program(this one is unordered map, mine is ordered map's RBTree). Is there any estimation about when this bug fix will be brought to release? Thanks!

jrmadsen commented 1 year ago

I’m tied up with the rocprofiler v2 rewrite right now. @benrichard-amd is looking into fixing #300 so that the testing can pass. Right now it’s just the code coverage job that is failing. If he cannot find a fix soon, I’ll just disable that job so that I can merge it, fix the bug, and generate a release

jinhongyii commented 1 year ago

Thanks @jrmadsen! waiting for your good news on the fix.

jrmadsen commented 1 year ago

Just generated the new release. Installers should be available shortly, however I haven't updated the installer generation to provide installers for ROCm 5.7 yet, just FYI.