ROCm / roctracer

ROCm Tracer Callback/Activity Library for Performance tracing AMD GPUs
https://rocm.docs.amd.com/projects/roctracer/en/latest/
Other
69 stars 30 forks source link

Feature Request: Allow Modification of HIP Arguments in the profiling callback in HIP Internal and/or a HIP API table Implementation #86

Closed matinraayai closed 7 months ago

matinraayai commented 1 year ago

Background

The Northeastern University Computer Architecture Lab (NUCAR) is currently working on a tool that would provide the same functionality as NVBit for instrumenting AMD GPU applications. Parallel to our group's work, AMD is actively supporting other instrumentation research efforts, most notably hiptracer worked on by @crozhon and his colleagues at UC Davis.

A CUDA Driver API interception layer via the LD_PRELOAD trick is how NVBit supports instrumentation on the device side (as well as the CUDA-realated host code). We require a similar layer to intercept HIP/HSA/OpenCL library function calls to achieve the same goal.

Our initial approach in our project (and similar to the approach taken in the hiptracer project at UC Davis) was to use the dlsym library and write this layer from scratch. However, upon further inspection, this does not seem to be the way to go because of the following reasons:

  1. The HIP/HSA stack has a very wide range of functions to intercept. Writing an interception layer using dlsym from scratch is a large undertaking (as of right now about 400 enumerated APIs just for HIP), not to mention its maintenance efforts. Naive implementations (e.g. overriding each and every library call manually) are not the best path forward.
  2. NVBit does not seem to use the dlsym library directly. It uses the function cuGetExportTable (discussions on how it presumably works) to get the internal tables of functions in the CUDA driver layer and modifies them for interception. CUDA calls are performed normally, and then jump into nvbit once the cuGetExportTable function is called inside the CUDA library. NVBit provides its own callback API and doesn't rely on CUPTI API as far as we know.

This led us take a close look at roctracer/rocprofiler to see how they manage to capture HIP/HSA API calls.

Relation to roctracer

Roctracer's callback API captures both HSA and HIP APIs, and allows preloading to work with any application. Roctracer seems to perform similiarly to cuGetExportTable as it interfaces with the HIP runtime directly through the profiling interface, and uses an exported table from HSA. We were able to use roctracer demo codes to capture API function call arguments in our project in a similar fashion to NVBit.

To instrument code, however, we require modification of the arguments to the API function calls. This is not allowed in roctracer, since tracing/profiling doesn't require modifying how the core application behaves.

Requested Feature

We request that support for modification of HIP/HSA API arguments in the callback API to be added to the ROCm stack. This can be added by first modifying the activity_rtapi_callback_t type which encapsulates user-specified callbacks to lose its const qualifier on its data argument as follows:

typedef void (*activity_rtapi_callback_t)(uint32_t domain, uint32_t cid, void* data, void* arg);

And the modified argument void* data is then carried over to the real API call. This feature doesn't necessarily need to be exposed through roctracer. We're more than happy to contribute it ourselves.

Reasons to Support this feature

  1. Easier tool development for instrumentation.
  2. A single interception layer that works for profiling, tracing and instrumentation.
  3. Better maintenance of this piece of code, as any new API added to HIP gets automatically added to the interception layer.
matinraayai commented 1 year ago

An update to this feature request

I was able to write my own tracing layer using HSA's API table provided by the ROCm runtime at startup and adding the AMD tool priority global parameter. This did not require the use of the dlsym library. However, due to the "callback function registration" mechanism in HIP, the only option for an instrumentation tool to intercept HIP functions and modifies its arguments is overriding the definition of HIP library calls and access the originals via dlsym. Although this is functional for my purpose, it still requires any other tool that uses my tool library to include it for LD_PRELOAD. I would like to update my feature request to either implement and expose HIP API tables to tool writers, in the same manner done by HSA, or allow modification of arguments in the HIP callback function, which can then be registered by a tool writer via dlsym-ing hipRegisterTracerCallback from the HIP library. Tagging @ammarwa for an update on this issue.

ammarwa commented 1 year ago

Hello,

Thank you for your detailed description!

We are considering changing the HIP Runtime API to be done in the same or similar way to the HSA API table methodology.

Let me know if you have more questions or if I missed anything here.

matinraayai commented 1 year ago

@ammarwa thank you for the update. Do you have any ETA on the feature implementation? I might be able to contribute this myself.

jrmadsen commented 1 year ago

@ammarwa thank you for the update. Do you have any ETA on the feature implementation? I might be able to contribute this myself.

I suspect the HIP API tables won't be available until ROCm 6.0 and this is something we want to handle ourselves.

Although this is functional for my purpose, it still requires any other tool that uses my tool library to include it for LD_PRELOAD.

You might want to look into using LLNL/GOTCHA. It effectively provides a programmatic LD_PRELOAD. I use it extensively in AMDResearch/omnitrace to wrap tons of functions from MPI, RCCL, numa, pthreads, and others without depending on LD_PRELOAD.

jrmadsen commented 1 year ago

Side note, I've been re-writing our intercept code for the HSA API tables and there is a clear path for us to support modification of the arguments in the callbacks before we pass those args to the underlying function. So potentially, it may be the case that by the time the HIP API tables are available, this will be possible through rocprofiler itself (unless we deem making this capability available in this context too risky)

matinraayai commented 1 year ago

@jrmadsen thank you for the explanation. I think making the HIP API table available through the tool_load function like the HSA API tables will be enough for a custom tool that requires modifications. That way rocprofiler can enforce its const requirement. Either way, as long as it is in the works and becomes available to the tool writers we're satisfied.

Also I will look into GOTCHA, seems like what we need until the HIP API table support is added. Thanks for pointing me to it.

Some lingering questions, is it possible for the hip registration functions __hipRegisterFatBinary and co to be also included in the API table? What is the criteria for a function to be included in the HIP API table? And what happens when an external API function is called by another external API function (or will that never happen due to using the "implementation" version of said API internally)?

matinraayai commented 7 months ago

We have solved this issue on our end by implementing callback functions ourselves with HSA, and using LD_PRELOAD for capturing HIP calls (waiting on HIP API tables to be added in the future).

We don't need this feature to be implemented in roctracer anymore, therefore I'm closing this issue. Thank you for the help and pointing us towards the right direction.