ROCm / omnitrace

Omnitrace: Application Profiling, Tracing, and Analysis
https://rocm.docs.amd.com/projects/omnitrace/en/latest/
MIT License
297 stars 27 forks source link

omnitrace-python errors with OMNITRACE_USE_ROCM_SMI = true #330

Open anupambhatnagar opened 8 months ago

anupambhatnagar commented 8 months ago

Hi, I'm profiling a triton kernel on MI300 with rocm 6.0.0.

  1. When I set OMNITRACE_USE_ROCM_SMI to true the collected trace fails to collect events from ROCM_SMI. the backtrace is available here.
  2. There is no track in the generated trace with the HIP Activity Device.

The omnitrace config I use is here.

How can I enable the collection of events from rocm-smi and view the HIP Activity Device?

Thanks!

P.S. I installed omnitrace using omnitrace-1.11.0-rhel-9.3-ROCm-60000-PAPI-OMPT-Python3.sh from the releases page.

jrmadsen commented 8 months ago

Based on the error message here, it looks like rocm-smi doesn’t support getting the temperature on MI300 so omnitrace disables rocm-smi sampling, which is why you don’t see any activity.

How can I enable the collection of events from rocm-smi and view the HIP Activity Device?

It looks like you won’t be able to collect ROCm-SMI data until there is either an omnitrace patch to selectively collect only the queries that are supported or rocm-smi adds full support for MI300

anupambhatnagar commented 8 months ago

I saw the error logs that you highlighted. I just wanted to confirm if it is expected behavior or not.

  1. Do you plan to add support (via a omnitrace patch) for MI300 in the near future?
  2. Why is the HIP Activity Device Track missing? I'm using rocm-6.0.0-91.
anupambhatnagar commented 8 months ago

For my current use case HIP Activity Device is much more important than the metrics provided by rocm smi. I collected the trace on the following toy program and it shows the device activity track.

Toy hip kernel

#include <cstdio>
#include <hip/hip_runtime.h>

#define CHECK(status) do { check((status), __FILE__, __LINE__); } while(false)
inline static void check(hipError_t error_code, const char *file, int line)
{
    if (error_code != hipSuccess)
    {
        fprintf(stderr, "HIP Error %d %s: %s. In file '%s' on line %d\n", error_code, hipGetErrorName(error_code), hipGetErrorString(error_code), file, line);
        fflush(stderr);
        exit(error_code);
    }
}

__global__ void dummy_kernel(int a)
{
    printf("I am dummy kernel %d\n", a);
}

int main()
{
    printf("AAA\n");
    CHECK(hipDeviceSynchronize());
    printf("BBB\n");
    dummy_kernel<<< 1,1 >>>(1);
    printf("CCC\n");
    CHECK(hipDeviceSynchronize());
    printf("DDD\n");
    CHECK(hipStreamAddCallback(0, [](hipStream_t stream_, hipError_t status_, void * arg){
        printf("I am host function\n");
    }, nullptr, 0));
    printf("EEE\n");
    CHECK(hipDeviceSynchronize());
    printf("FFF\n");
    dummy_kernel<<< 1,1 >>>(2);
    printf("GGG\n");
    CHECK(hipDeviceSynchronize());
    printf("HHH\n");

    return 0;
}

Command used to profile the above example: omnitrace-run -c config-file.cfg -- ./binary_name.

omnitrace config

# auto-generated by omnitrace-avail (version 1.11.0) on 2024-01-24 @ 18:12

OMNITRACE_CONFIG_FILE               =
OMNITRACE_TRACE                     =
OMNITRACE_TRACE_DELAY               = 0
OMNITRACE_TRACE_DURATION            = 0
OMNITRACE_TRACE_PERIOD_CLOCK_ID     = CLOCK_REALTIME
OMNITRACE_TRACE_PERIODS             =
OMNITRACE_PROFILE                   = false
OMNITRACE_USE_SAMPLING              = true
OMNITRACE_USE_PROCESS_SAMPLING      = true
OMNITRACE_USE_ROCTRACER             = true
OMNITRACE_USE_ROCM_SMI              = false 
OMNITRACE_USE_KOKKOSP               = false
OMNITRACE_USE_CAUSAL                = false
OMNITRACE_USE_MPIP                  = true
OMNITRACE_USE_PID                   = true
OMNITRACE_USE_RCCLP                 = false
OMNITRACE_USE_ROCPROFILER           = true 
OMNITRACE_USE_ROCTX                 = false
OMNITRACE_OUTPUT_PATH               = omnitrace-%tag%-output
OMNITRACE_OUTPUT_PREFIX             =
OMNITRACE_CAUSAL_BACKEND            = auto
OMNITRACE_CAUSAL_BINARY_EXCLUDE     =
OMNITRACE_CAUSAL_BINARY_SCOPE       = %MAIN%
OMNITRACE_CAUSAL_DELAY              = 0
OMNITRACE_CAUSAL_DURATION           = 0
OMNITRACE_CAUSAL_FUNCTION_EXCLUDE   = 
OMNITRACE_CAUSAL_FUNCTION_SCOPE     = 
OMNITRACE_CAUSAL_MODE               = function
OMNITRACE_CAUSAL_RANDOM_SEED        = 0
OMNITRACE_CAUSAL_SOURCE_EXCLUDE     = 
OMNITRACE_CAUSAL_SOURCE_SCOPE       = 
OMNITRACE_CRITICAL_TRACE            = false
OMNITRACE_PAPI_EVENTS               = 
OMNITRACE_PERFETTO_BACKEND          = inprocess
OMNITRACE_PERFETTO_BUFFER_SIZE_KB   = 1024000
OMNITRACE_PERFETTO_FILL_POLICY      = discard
OMNITRACE_PROCESS_SAMPLING_DURATION = -1
OMNITRACE_PROCESS_SAMPLING_FREQ     = 0
OMNITRACE_ROCM_EVENTS               = 
OMNITRACE_SAMPLING_CPUS             = none
OMNITRACE_SAMPLING_DELAY            = 0.5
OMNITRACE_SAMPLING_DURATION         = 0
OMNITRACE_SAMPLING_FREQ             = 300
OMNITRACE_SAMPLING_GPUS             = all
OMNITRACE_SAMPLING_OVERFLOW_EVENT   = perf::PERF_COUNT_HW_CACHE_REFERENCES
OMNITRACE_TIME_OUTPUT               = true
OMNITRACE_TIMEMORY_COMPONENTS       = wall_clock
OMNITRACE_VERBOSE                   = 1
OMNITRACE_ENABLED                   = true
OMNITRACE_SUPPRESS_CONFIG           = false 
OMNITRACE_SUPPRESS_PARSING          = false 

Triton code

import torch
import triton
import triton.language as tl

@triton.jit
def add_kernel(x_ptr,  y_ptr, output_ptr,  n_elements, BLOCK_SIZE: tl.constexpr):
    pid = tl.program_id(axis=0)
    block_start = pid * BLOCK_SIZE
    offsets = block_start + tl.arange(0, BLOCK_SIZE)
    mask = offsets < n_elements
    x = tl.load(x_ptr + offsets, mask=mask)
    y = tl.load(y_ptr + offsets, mask=mask)
    output = x + y
    tl.store(output_ptr + offsets, output, mask=mask)

def add(x: torch.Tensor, y: torch.Tensor):
    output = torch.empty_like(x)
    assert x.is_cuda and y.is_cuda and output.is_cuda
    n_elements = output.numel()
    grid = lambda meta: (triton.cdiv(n_elements, meta['BLOCK_SIZE']), )
    add_kernel[grid](x, y, output, n_elements, BLOCK_SIZE=1024)
    return output

def main():
    torch.manual_seed(0)
    size = 98432
    x = torch.rand(size, device='cuda')
    y = torch.rand(size, device='cuda')

    output_triton = add(x, y)

if __name__ == "__main__":
    main()

My setup: MI 300X, omnitrace 1.11, rocm-6.0.0-91

Any idea why the HIP activity trace doesn't render with the same config while I profile the triton kernel?

jrmadsen commented 8 months ago

Try:

omnitrace-run -c config-file.cfg -- python -m omnitrace <triton-python-script>

anupambhatnagar commented 8 months ago

omnitrace-run -c config-file.cfg -- python -m omnitrace <triton-python-script>

that didn't help. same result as before.

jrmadsen commented 8 months ago

There may be some issues regardless which require some detailed explanation. I’ve got a full docket today so I’ll try to provide that once I’ve got some time.

jrmadsen commented 8 months ago

But in the meantime, I’ll just let you know that you’ll probably want to try to play with LD_LIBRARY_PATH to get Omnitrace to use the same ROCm libraries as PyTorch, but it may not be possible if PyTorch doesn’t have/use ROCm libraries with SOVERSIONs (e.g. only libroctracer.so instead of libroctracer.so.4). It’s something we have a solution for in the new rocprofiler but until it’s released, there’s very little Omnitrace can do.

jrmadsen commented 8 months ago

Well actually, I’ve probably got enough time now. The fundamental problem I’ve seen with some PyTorch apps in the past is that PyTorch has an RPATH to the ROCm libraries it installs and those libs do not have SOVERSIONs. Omnitrace sets an env variable HSA_TOOLS_LIB which causes the HSA runtime to call an OnLoad function when it initializes (which is triggered on first HIP call). When that happens, Omnitrace makes the appropriate calls to roctracer to set up tracing. But roctracer is linked to the HSA and HIP runtimes with SOVERSIONs. My theory (which I haven’t fully confirmed but empirical evidence from experimentation with LD_PRELOAD and making soft links in PyTorch installs to emulate SOVERSIONs does suggest) is that roctracer ends up communicating with different runtime libraries and effectively enables instrumenting a different HIP/HSA runtime than the one PyTorch uses. Thus, from Omnitrace’s perspective it enables tracing HIP but the application simply never called the HIP API or launched any kernels.

jrmadsen commented 8 months ago

Could you do me a favor and run your app normally (without Omnitrace) and before it exits, print out /proc/self/maps? If there aren’t any ROCm libs loaded, the last comment is probably true and the ROCm libs I see in the maps printout of the backtrace are only there bc of omnitrace

jrmadsen commented 8 months ago

And for the record, the way we are addressing this issue in the new rocprofiler (which combines the capabilities of roctracer and rocprofiler) is that rocprofiler doesn’t link to the runtimes and each runtime effectively passes a table of function pointers into rocprofiler when it initializes — guaranteeing that the calls (via the function pointers in the table) that rocprofiler needs to make to enable profiling capabilities are applied to that specific runtime instance. Once this is release and Omnitrace uses the new rocprofiler API, you could have 20 different HIP runtimes and Omnitrace would be able to trace any/all of them.

anupambhatnagar commented 8 months ago

Thanks for the detailed answer. I'll try to get the maps and share them with you.

Could you please join the ext-amd-meta slack channel? I sent a request to Weijun Jiang yesterday to add you to it. It would make collaborating on this easier. Thanks!

ppanchad-amd commented 1 week ago

Hi @anupambhatnagar. Has your issue been resolved? If so, please close the ticket. Thanks!