ROCm / omnitrace

Omnitrace: Application Profiling, Tracing, and Analysis
https://rocm.docs.amd.com/projects/omnitrace/en/latest/
MIT License
297 stars 27 forks source link

Omnitrace hangs with omnitrace-instrument #329

Closed anupambhatnagar closed 8 months ago

anupambhatnagar commented 9 months ago

Hi, I'm trying to instrument a binary application on MI300X with Omnitrace. To ensure that my installation is working I used the example script here to ensure that omnitrace-instrument and omnitrace-run commands are working as expected. I'm able to generate the perfetto trace and view it.

On my executable, Omnitrace launches and seems to hang. Here's the backtrace. Any suggestions to debug this would be highly appreciated. Thank you!

https://gist.github.com/anupambhatnagar/ad76524da1ca783f18ec08ad5805ac06

jrmadsen commented 9 months ago

Are you certain the application is hanging? Is there a way to check CPU activity in another console while the application is running? I ask because runtime instrumentation unfortunately tends to take a very long time because it ends up parsing not only your executable but every library linked to your executable, which is why I generally recommend binary rewrites if you don’t want to instrument the shared libraries linked to the executable. If you are unsure, it might help to just use omnitrace-run with sampling enabled on an uninstrumented executable to see if the backtraces show a lot of time being spent in the linked libraries

anupambhatnagar commented 9 months ago

Thanks @jrmadsen for the prompt reply. I'll monitor the CPU activity to verify if it is running or hanging and also use omnitrace-run.

anupambhatnagar commented 8 months ago

I tried omnitrace-run on my binary and it kept running for over an hour at which point I exited using Ctrl-C. The binary I have is a basic triton kernel which executes in less than a couple of seconds with triton and pytorch. The build system I use (buck) packages everything together and generates a 700MB executable. Unfortunately, executing ldd on the file says it is not a dynamic executable so I can't see the linked libraries.

I also tried omnitrace-run --enable-categories rocprofiler -- ./rms_norm.par but it didn't help. Top show CPU utilization is 0.0%.

❯ omnitrace-run --enable-categories rocprofiler -- ./rms_norm.par

OMNITRACE: HSA_TOOLS_LIB=/home/anupamb/omnitrace/lib/libomnitrace-dl.so.1.11.0
OMNITRACE: HSA_TOOLS_REPORT_LOAD_FAILURE=1
OMNITRACE: LD_PRELOAD=/home/anupamb/omnitrace/lib/libomnitrace-dl.so.1.11.0
OMNITRACE: OMNITRACE_ENABLE_CATEGORIES=rocprofiler
OMNITRACE: OMP_TOOL_LIBRARIES=/home/anupamb/omnitrace/lib/libomnitrace-dl.so.1.11.0
OMNITRACE: ROCP_HSA_INTERCEPT=1
OMNITRACE: ROCP_TOOL_LIB=/home/anupamb/omnitrace/lib/libomnitrace.so.1.11.0
[omnitrace][dl][1292192] omnitrace_main
[omnitrace][1292192][omnitrace_init_tooling] Instrumentation mode: Sampling

      ______   .___  ___. .__   __.  __  .___________..______          ___       ______  _______
     /  __  \  |   \/   | |  \ |  | |  | |           ||   _  \        /   \     /      ||   ____|
    |  |  |  | |  \  /  | |   \|  | |  | `---|  |----`|  |_)  |      /  ^  \   |  ,----'|  |__
    |  |  |  | |  |\/|  | |  . `  | |  |     |  |     |      /      /  /_\  \  |  |     |   __|
    |  `--'  | |  |  |  | |  |\   | |  |     |  |     |  |\  \----./  _____  \ |  `----.|  |____
     \______/  |__|  |__| |__| \__| |__|     |__|     | _| `._____/__/     \__\ \______||_______|

    omnitrace v1.11.0 (rev: 77d52814e9050004cfb11d7917e155b00ab861b1, tag: v1.11.0, compiler: GNU v11.4.1, rocm: v6.0.x)
jrmadsen commented 8 months ago

I was not aware this was a PyTorch app. If your executable is 700 MB, I’m not surprised Dyninst takes forever to parse the binary. You’ve clearly got a deadlock, sampling doesn’t slow down an app that runs in a couple of seconds to more than a minute or two. Are you executing on multiple GPUs? PyTorch RPATHs its own ROCm libraries (or in your case, it might statically link or dlopen them), this is not going to play nice with Omnitrace loading a different ROCm runtime.

jrmadsen commented 8 months ago

Honestly, I’d probably install the omnitrace that doesn’t have support for ROCm. Until we complete our work on a new roctracer/rocprofiler implementation that doesn’t link to the HIP/HSA runtimes, there’s very little tools like Omnitrace can do for apps like PyTorch which have their own “hidden” ROCm distributions that they use bc it results in multiple ROCm runtimes being loaded.

anupambhatnagar commented 8 months ago

I got omnitrace working with my triton kernel on MI300. To get it working, I built pytorch from source on MI300, installed triton-rocm and then ran omnitrace on my kernel. It worked flawlessly. Kudos to you for building this high quality software.

I will be diving deeper into it next week and will reach out if I have more questions, which I most likely will 😄 . I love the fact that you dump Perfetto compatible output.