ROCm / omnitrace

Omnitrace: Application Profiling, Tracing, and Analysis
https://rocm.docs.amd.com/projects/omnitrace/en/latest/
MIT License
291 stars 23 forks source link

Enabling Detailed Profiling of Graph Nodes in OmniTrace #335

Open OmarSayedMostafa opened 5 months ago

OmarSayedMostafa commented 5 months ago

Hi, I am currently working on profiling VLLM and I observed that the tool captures the execution of graph kernels at a high level but does not provide detailed insights into individual graph nodes' execution. image image

My goal is to obtain detailed profiling information on the execution of individual graph nodes, similar to the capabilities offered by Nvidia Nsight, which allows for tracking nodes instead of just graph-level execution. image

I am seeking guidance or a workaround to enable detailed profiling of graph nodes within OmniTrace. Any insights or configuration options?

here is the command I use: omnitrace-run -c ~/.omnitrace.cfg --enable-categories device-critical-trace device_busy device_hip device_hsa device_memory_usage python rocm_hip rocm_hsa rocm_smi rocprofiler roctracer --roctracer-hip-activity --roctracer-hip-api --roctracer-hsa-activity --roctracer-hsa-api -- python -m omnitrace -- vllm_benchmark.py

Thanks in advance.

jrmadsen commented 5 months ago

Given that the arrows flow from the API functions to multiple kernels, it appears that you are indeed getting the individual graph node execution. The --roctracer-hsa-activity option that you have enables that. You might want to remove the --hip-device-activity option bc that is the “high-level” kernel tracing option and doing both simultaneously might be doing funny things with the connection of the flow events and could also contribute to why none of the kernel function names are getting resolved beyond “Kernel Execution”.