Open anupambhatnagar opened 8 months ago
Based on the error message here, it looks like rocm-smi doesn’t support getting the temperature on MI300 so omnitrace disables rocm-smi sampling, which is why you don’t see any activity.
How can I enable the collection of events from rocm-smi and view the HIP Activity Device?
It looks like you won’t be able to collect ROCm-SMI data until there is either an omnitrace patch to selectively collect only the queries that are supported or rocm-smi adds full support for MI300
I saw the error logs that you highlighted. I just wanted to confirm if it is expected behavior or not.
For my current use case HIP Activity Device is much more important than the metrics provided by rocm smi. I collected the trace on the following toy program and it shows the device activity track.
#include <cstdio>
#include <hip/hip_runtime.h>
#define CHECK(status) do { check((status), __FILE__, __LINE__); } while(false)
inline static void check(hipError_t error_code, const char *file, int line)
{
if (error_code != hipSuccess)
{
fprintf(stderr, "HIP Error %d %s: %s. In file '%s' on line %d\n", error_code, hipGetErrorName(error_code), hipGetErrorString(error_code), file, line);
fflush(stderr);
exit(error_code);
}
}
__global__ void dummy_kernel(int a)
{
printf("I am dummy kernel %d\n", a);
}
int main()
{
printf("AAA\n");
CHECK(hipDeviceSynchronize());
printf("BBB\n");
dummy_kernel<<< 1,1 >>>(1);
printf("CCC\n");
CHECK(hipDeviceSynchronize());
printf("DDD\n");
CHECK(hipStreamAddCallback(0, [](hipStream_t stream_, hipError_t status_, void * arg){
printf("I am host function\n");
}, nullptr, 0));
printf("EEE\n");
CHECK(hipDeviceSynchronize());
printf("FFF\n");
dummy_kernel<<< 1,1 >>>(2);
printf("GGG\n");
CHECK(hipDeviceSynchronize());
printf("HHH\n");
return 0;
}
Command used to profile the above example: omnitrace-run -c config-file.cfg -- ./binary_name
.
# auto-generated by omnitrace-avail (version 1.11.0) on 2024-01-24 @ 18:12
OMNITRACE_CONFIG_FILE =
OMNITRACE_TRACE =
OMNITRACE_TRACE_DELAY = 0
OMNITRACE_TRACE_DURATION = 0
OMNITRACE_TRACE_PERIOD_CLOCK_ID = CLOCK_REALTIME
OMNITRACE_TRACE_PERIODS =
OMNITRACE_PROFILE = false
OMNITRACE_USE_SAMPLING = true
OMNITRACE_USE_PROCESS_SAMPLING = true
OMNITRACE_USE_ROCTRACER = true
OMNITRACE_USE_ROCM_SMI = false
OMNITRACE_USE_KOKKOSP = false
OMNITRACE_USE_CAUSAL = false
OMNITRACE_USE_MPIP = true
OMNITRACE_USE_PID = true
OMNITRACE_USE_RCCLP = false
OMNITRACE_USE_ROCPROFILER = true
OMNITRACE_USE_ROCTX = false
OMNITRACE_OUTPUT_PATH = omnitrace-%tag%-output
OMNITRACE_OUTPUT_PREFIX =
OMNITRACE_CAUSAL_BACKEND = auto
OMNITRACE_CAUSAL_BINARY_EXCLUDE =
OMNITRACE_CAUSAL_BINARY_SCOPE = %MAIN%
OMNITRACE_CAUSAL_DELAY = 0
OMNITRACE_CAUSAL_DURATION = 0
OMNITRACE_CAUSAL_FUNCTION_EXCLUDE =
OMNITRACE_CAUSAL_FUNCTION_SCOPE =
OMNITRACE_CAUSAL_MODE = function
OMNITRACE_CAUSAL_RANDOM_SEED = 0
OMNITRACE_CAUSAL_SOURCE_EXCLUDE =
OMNITRACE_CAUSAL_SOURCE_SCOPE =
OMNITRACE_CRITICAL_TRACE = false
OMNITRACE_PAPI_EVENTS =
OMNITRACE_PERFETTO_BACKEND = inprocess
OMNITRACE_PERFETTO_BUFFER_SIZE_KB = 1024000
OMNITRACE_PERFETTO_FILL_POLICY = discard
OMNITRACE_PROCESS_SAMPLING_DURATION = -1
OMNITRACE_PROCESS_SAMPLING_FREQ = 0
OMNITRACE_ROCM_EVENTS =
OMNITRACE_SAMPLING_CPUS = none
OMNITRACE_SAMPLING_DELAY = 0.5
OMNITRACE_SAMPLING_DURATION = 0
OMNITRACE_SAMPLING_FREQ = 300
OMNITRACE_SAMPLING_GPUS = all
OMNITRACE_SAMPLING_OVERFLOW_EVENT = perf::PERF_COUNT_HW_CACHE_REFERENCES
OMNITRACE_TIME_OUTPUT = true
OMNITRACE_TIMEMORY_COMPONENTS = wall_clock
OMNITRACE_VERBOSE = 1
OMNITRACE_ENABLED = true
OMNITRACE_SUPPRESS_CONFIG = false
OMNITRACE_SUPPRESS_PARSING = false
import torch
import triton
import triton.language as tl
@triton.jit
def add_kernel(x_ptr, y_ptr, output_ptr, n_elements, BLOCK_SIZE: tl.constexpr):
pid = tl.program_id(axis=0)
block_start = pid * BLOCK_SIZE
offsets = block_start + tl.arange(0, BLOCK_SIZE)
mask = offsets < n_elements
x = tl.load(x_ptr + offsets, mask=mask)
y = tl.load(y_ptr + offsets, mask=mask)
output = x + y
tl.store(output_ptr + offsets, output, mask=mask)
def add(x: torch.Tensor, y: torch.Tensor):
output = torch.empty_like(x)
assert x.is_cuda and y.is_cuda and output.is_cuda
n_elements = output.numel()
grid = lambda meta: (triton.cdiv(n_elements, meta['BLOCK_SIZE']), )
add_kernel[grid](x, y, output, n_elements, BLOCK_SIZE=1024)
return output
def main():
torch.manual_seed(0)
size = 98432
x = torch.rand(size, device='cuda')
y = torch.rand(size, device='cuda')
output_triton = add(x, y)
if __name__ == "__main__":
main()
My setup: MI 300X, omnitrace 1.11, rocm-6.0.0-91
Any idea why the HIP activity trace doesn't render with the same config while I profile the triton kernel?
Try:
omnitrace-run -c config-file.cfg -- python -m omnitrace <triton-python-script>
omnitrace-run -c config-file.cfg -- python -m omnitrace <triton-python-script>
that didn't help. same result as before.
There may be some issues regardless which require some detailed explanation. I’ve got a full docket today so I’ll try to provide that once I’ve got some time.
But in the meantime, I’ll just let you know that you’ll probably want to try to play with LD_LIBRARY_PATH to get Omnitrace to use the same ROCm libraries as PyTorch, but it may not be possible if PyTorch doesn’t have/use ROCm libraries with SOVERSIONs (e.g. only libroctracer.so instead of libroctracer.so.4). It’s something we have a solution for in the new rocprofiler but until it’s released, there’s very little Omnitrace can do.
Well actually, I’ve probably got enough time now. The fundamental problem I’ve seen with some PyTorch apps in the past is that PyTorch has an RPATH to the ROCm libraries it installs and those libs do not have SOVERSIONs. Omnitrace sets an env variable HSA_TOOLS_LIB which causes the HSA runtime to call an OnLoad function when it initializes (which is triggered on first HIP call). When that happens, Omnitrace makes the appropriate calls to roctracer to set up tracing. But roctracer is linked to the HSA and HIP runtimes with SOVERSIONs. My theory (which I haven’t fully confirmed but empirical evidence from experimentation with LD_PRELOAD and making soft links in PyTorch installs to emulate SOVERSIONs does suggest) is that roctracer ends up communicating with different runtime libraries and effectively enables instrumenting a different HIP/HSA runtime than the one PyTorch uses. Thus, from Omnitrace’s perspective it enables tracing HIP but the application simply never called the HIP API or launched any kernels.
Could you do me a favor and run your app normally (without Omnitrace) and before it exits, print out /proc/self/maps
? If there aren’t any ROCm libs loaded, the last comment is probably true and the ROCm libs I see in the maps printout of the backtrace are only there bc of omnitrace
And for the record, the way we are addressing this issue in the new rocprofiler (which combines the capabilities of roctracer and rocprofiler) is that rocprofiler doesn’t link to the runtimes and each runtime effectively passes a table of function pointers into rocprofiler when it initializes — guaranteeing that the calls (via the function pointers in the table) that rocprofiler needs to make to enable profiling capabilities are applied to that specific runtime instance. Once this is release and Omnitrace uses the new rocprofiler API, you could have 20 different HIP runtimes and Omnitrace would be able to trace any/all of them.
Thanks for the detailed answer. I'll try to get the maps and share them with you.
Could you please join the ext-amd-meta
slack channel? I sent a request to Weijun Jiang yesterday to add you to it. It would make collaborating on this easier. Thanks!
Hi @anupambhatnagar. Has your issue been resolved? If so, please close the ticket. Thanks!
Hi, I'm profiling a triton kernel on MI300 with rocm 6.0.0.
true
the collected trace fails to collect events from ROCM_SMI. the backtrace is available here.The omnitrace config I use is here.
How can I enable the collection of events from rocm-smi and view the HIP Activity Device?
Thanks!
P.S. I installed omnitrace using
omnitrace-1.11.0-rhel-9.3-ROCm-60000-PAPI-OMPT-Python3.sh
from the releases page.