ROCm / rocprofiler

ROC profiler library. Profiling with perf-counters and derived metrics.
https://rocm.docs.amd.com/projects/rocprofiler/en/latest/
MIT License
126 stars 46 forks source link

Infinite recursion in librocprofiler.so #66

Open mxz297 opened 2 years ago

mxz297 commented 2 years ago

Hi,

I am a developer from the HPCToolkit project at Rice University. I am developing AMD GPU counter support in HPCToolkit directly using rocprofiler API. I am currently running into an infinite recursion in librocprofiler.so with the following stack trace:

#0  0x00007f38ba752e76 in d_print_comp_inner () from /lib64/libstdc++.so.6
#1  0x00007f38ba7572b8 in d_print_comp () from /lib64/libstdc++.so.6
#2  0x00007f38ba753c57 in d_print_comp_inner () from /lib64/libstdc++.so.6
#3  0x00007f38ba7572b8 in d_print_comp () from /lib64/libstdc++.so.6
#4  0x00007f38ba75584d in d_print_comp_inner () from /lib64/libstdc++.so.6
#5  0x00007f38ba7572b8 in d_print_comp () from /lib64/libstdc++.so.6
#6  0x00007f38ba75584d in d_print_comp_inner () from /lib64/libstdc++.so.6
#7  0x00007f38ba7572b8 in d_print_comp () from /lib64/libstdc++.so.6
#8  0x00007f38ba75584d in d_print_comp_inner () from /lib64/libstdc++.so.6
#9  0x00007f38ba7572b8 in d_print_comp () from /lib64/libstdc++.so.6
#10 0x00007f38ba757d02 in d_print_function_type.isra () from /lib64/libstdc++.so.6
#11 0x00007f38ba75474a in d_print_comp_inner () from /lib64/libstdc++.so.6
#12 0x00007f38ba7572b8 in d_print_comp () from /lib64/libstdc++.so.6
#13 0x00007f38ba754a42 in d_print_comp_inner () from /lib64/libstdc++.so.6
#14 0x00007f38ba7572b8 in d_print_comp () from /lib64/libstdc++.so.6
#15 0x00007f38ba75387d in d_print_comp_inner () from /lib64/libstdc++.so.6
#16 0x00007f38ba7572b8 in d_print_comp () from /lib64/libstdc++.so.6
#17 0x00007f38ba75907f in d_demangle_callback.constprop () from /lib64/libstdc++.so.6
#18 0x00007f38ba759361 in __cxa_demangle () from /lib64/libstdc++.so.6
#19 0x00007f38b0b51dd5 in ?? () from /opt/rocm-4.3.1/lib/librocprofiler64.so
#20 0x00007f38b8cbfc6d in rocr::amd::hsa::loader::ExecutableImpl::IterateSymbols(hsa_status_t (*)(hsa_executable_s, hsa_executable_symbol_s, void*), void*) ()
   from /opt/rocm-4.3.1/lib/libhsa-runtime64.so.1
#21 0x00007f38b8c9b853 in rocr::HSA::hsa_executable_iterate_symbols(hsa_executable_s, hsa_status_t (*)(hsa_executable_s, hsa_executable_symbol_s, void*), void*) ()
   from /opt/rocm-4.3.1/lib/libhsa-runtime64.so.1
#22 0x00007f38b0b4fa4a in ?? () from /opt/rocm-4.3.1/lib/librocprofiler64.so
#23 0x00007f38b0b4fa5a in ?? () from /opt/rocm-4.3.1/lib/librocprofiler64.so
#24 0x00007f38b0b4fa5a in ?? () from /opt/rocm-4.3.1/lib/librocprofiler64.so
#25 0x00007f38b0b4fa5a in ?? () from /opt/rocm-4.3.1/lib/librocprofiler64.so
#26 0x00007f38b0b4fa5a in ?? () from /opt/rocm-4.3.1/lib/librocprofiler64.so
#27 0x00007f38b0b4fa5a in ?? () from /opt/rocm-4.3.1/lib/librocprofiler64.so
#28 0x00007f38b0b4fa5a in ?? () from /opt/rocm-4.3.1/lib/librocprofiler64.so
#29 0x00007f38b0b4fa5a in ?? () from /opt/rocm-4.3.1/lib/librocprofiler64.so
#30 0x00007f38b0b4fa5a in ?? () from /opt/rocm-4.3.1/lib/librocprofiler64.so
#31 0x00007f38b0b4fa5a in ?? () from /opt/rocm-4.3.1/lib/librocprofiler64.so
#32 0x00007f38b0b4fa5a in ?? () from /opt/rocm-4.3.1/lib/librocprofiler64.so
#33 0x00007f38b0b4fa5a in ?? () from /opt/rocm-4.3.1/lib/librocprofiler64.so
#34 0x00007f38b0b4fa5a in ?? () from /opt/rocm-4.3.1/lib/librocprofiler64.so
#35 0x00007f38b0b4fa5a in ?? () from /opt/rocm-4.3.1/lib/librocprofiler64.so
#36 0x00007f38b0b4fa5a in ?? () from /opt/rocm-4.3.1/lib/librocprofiler64.so
#37 0x00007f38b0b4fa5a in ?? () from /opt/rocm-4.3.1/lib/librocprofiler64.so
#38 0x00007f38b0b4fa5a in ?? () from /opt/rocm-4.3.1/lib/librocprofiler64.so
#39 0x00007f38b0b4fa5a in ?? () from /opt/rocm-4.3.1/lib/librocprofiler64.so

You can see this is with rocm-4.3.1. With rocm-4.3.1, I was able to work around this issue by settings->code_obj_tracking = 0; inside OnLoadToolProp:

https://github.com/HPCToolkit/hpctoolkit/blob/rocprofiler_support/src/tool/hpcrun/gpu/amd/rocprofiler-api.c#L436

This work around does not seem to work with rocm-4.5.0 and I am seeing a similar infinite recursion in librocprofiler.

Are there any recommendations or insights on resolving this problem?

mxz297 commented 2 years ago

I recompiled rocprofiler from source for both rocm-4.3.1 and rocm-4.5.2 to have a better understanding of the problem. On surface, the infinite recursion happens because the interceptor function for code object freeze operation is calling itself (https://github.com/ROCm-Developer-Tools/rocprofiler/blob/rocm-4.3.x/src/util/hsa_rsrc_factory.cpp#L779).

Previously for rocm-4.3.1, I was able to work around this problem by disabling code object tracking. This workaround no longer works in rocm-4.5.2 as code object tracking is always enabled (https://github.com/ROCm-Developer-Tools/rocprofiler/blob/rocm-4.5.2/src/core/rocprofiler.cpp#L429). A user can still set the field of disabling code object tracking, but then be ignored. It would be really helpful to document these important internal changes.

Now back to the real problem that why the infinite recursion happened:

hsa_status_t HsaRsrcFactory::hsa_executable_freeze_interceptor(hsa_executable_t executable, const char *options) {
  std::lock_guard<mutex_t> lck(mutex_);
  if (symbols_map_ == NULL) symbols_map_ = new symbols_map_t;
  hsa_status_t status = hsa_api_.hsa_executable_iterate_symbols(executable, executable_symbols_cb, NULL);
  CHECK_STATUS("Error in iterating executable symbols", status);
  return hsa_api_.hsa_executable_freeze(executable, options);
}

The function pointer in the return statement ends up with being HsaRsrcFactory::hsa_executable_freeze_interceptor causing the recursion. hsa_api_ is set in function HsaRsrcFactory::InitHsaApiTable (https://github.com/ROCm-Developer-Tools/rocprofiler/blob/rocm-4.3.x/src/util/hsa_rsrc_factory.cpp#L184). Based on my understanding, hsa_api_ records a set of actual HSA calls, so the interceptors can be interposed upon HSA calls.

HsaRsrcFactory::InitHsaApiTable is called two places:

  1. https://github.com/ROCm-Developer-Tools/rocprofiler/blob/rocm-4.3.x/src/util/hsa_rsrc_factory.cpp#L127
  2. https://github.com/ROCm-Developer-Tools/rocprofiler/blob/rocm-4.3.x/src/core/rocprofiler.cpp#L114

When I trace in gdb, the following event happened:

  1. InitHsaApiTable is called in the first call site with input parameter NULL. This leads to hsa_api_ to be initialized with function pointers to HSA API entries
  2. InitHsaApiTable is called in the second call site with an input parameter, which represents the actual implementation of HSA APIs. This call does not update to hsa_api_ due to the first if statement at the beginning of InitHsaApiTable.

The fundamental issue here is that HSA API entries are just a wrapper function around the actual implementation function. For example, the API entry for hsa_executable_freeze is shown as follow (I do not find HSA runtime source code, so I just disassemble the shared library)

0000000000067940 <hsa_executable_freeze>:
   67940:       48 8b 05 89 2a 42 00    mov    0x422a89(%rip),%rax        # 48a3d0 <_ZL12coreApiTable>
   67947:       ff a0 c0 02 00 00       jmpq   *0x2c0(%rax)

It is clear to me that this function just reads a function table and then do a tail call to the actual implementation.

Now the problem is that if hsa_api_ points to this public entry function, which is just a wrapper, and then later rocprofiler updates the actual coreAPITable with its interceptor, we end up with the interceptor calling itself.

To resolve this problem, it looks to me that we should just remove the first if state in InitHsaApiTable (https://github.com/ROCm-Developer-Tools/rocprofiler/blob/rocm-4.3.x/src/util/hsa_rsrc_factory.cpp#L187). This ensures that rocprofiler can get the actual hsa implementation functions provided by HSA runtime and can call hsa function when the hsa implementation table is not ready. With this change, at least locally I can resolve the infinite recursion problem.

While I would like to make a PR for this, I find that the rocprofiler github repo is in a quite strange state: Tag rocm-4.5.2 is shown to contain commits not in the repo. Both amd-master branch and the rocm-4.5.2 branch are behind the rocm-4.5.2 tag. Against which branch should I make the PR?

mxz297 commented 2 years ago

A similar infinite recursion showed up when using the code object URI callback in roctracer (https://github.com/ROCm-Developer-Tools/roctracer/blob/amd-master/test/app/codeobj_test.cpp#L60). After some investigation, I posted a PR against rocm-4.5.x branch for both instances of infinite recursion (#70)

kikimych commented 2 years ago

Could you please create a small reproducer and share the command line for running the test? By default table->core->hsa_executable_freeze_fn is equal to rocr::HSA::hsa_executable_freeze. This means that the checking table for NULL is obsolete in that case.

mxz297 commented 2 years ago

@kikimych A reproducer uploaded. rocprofiler-test.tar.gz

marklawsAMD commented 1 year ago

Hi @mxz297,

I can reproduce this on current rocprof with your example code; thanks. I'll see if your patch in #70 fixes it (thank you for the PR too!), otherwise I'll have to keep looking into it.

ppanchad-amd commented 1 month ago

@mxz297 Can you please check if your issue still exist in the latest ROCm 6.2? If resolved, please close the ticket. Thanks!

sohaibnd commented 1 week ago

Hi @mxz297, thanks for your patience.

I was able to reproduce your issue on latest ROCm using the reproducer code provided. You are correct that the hsa_api_ table of functions that rocprofiler uses is not correctly being initialized. The reason is that before the rocprofiler tool can be used, the HSA runtime has to be initialized during which it loads in rocprofiler. However, HSA calls pthread_create during initialization and you have modified pthread_create to call rocprofiler_iterate_info, so we end up calling rocprofiler_iterate_info before HSA has completed initialization. This leads to the hsaapi being incorrectly initialized and causes the infinite recursion problem.

The solution here is to make sure HSA has been initialized (you can do this explicitly by calling hsa_init()) completely before any calls to the rocprofiler API. Here, it is tricky since as mentioned before HSA uses pthread_create during initialization. If you wish to keep rocprofiler API calls inside your pthread_create, you would need to guarantee you call hsa_init before any calls to pthread_create happen (perhaps by using a __attribute__((constructor)) function to call hsa_init when libtest.so loads) and that these rocprofiler API calls are disabled until HSA has been initialized (by using a flag). I have attached the modified libtest.c code as an example fix.

Also, note that OnLoadToolProp is called when HSA loads in rocprofiler so you can use that function for any rocprofiler related initialization that you need to do in your own code.

Please let me know if that fixes your issue!

libtest_modified.zip Link to HSA runtime source code