The inconsistency between filtering by kernel and patch

ROCm / rocprofiler-compute

Advanced Profiling and Analytics for AMD Hardware

https://rocm.docs.amd.com/projects/omniperf/en/latest/

MIT License

135 stars 49 forks source link

The inconsistency between filtering by kernel and patch #382

Closed bangtianliu closed 1 week ago

bangtianliu commented 3 months ago

Describe the bug I use OmniPerf to profile the execution of Stable Diffusion XL (SDXL) on MI300X, where a single matmul_transpose_b kernel is executed 180 times. My focus is on the performance behavior of this matmul_transpose_b kernel. However, when I tried to filter by kernel and dispatch, I noticed some inconsistencies. Please check the below snapshots for the details, in which you can see the difference in the reported L2 cache hit rate.

Development Environment:

Linux Distribution: [Ubuntu 22.04.2 LTS]
Omniperf Version: [ 2.0.1 (release)]
GPU: [ MI300X]
Custer (if applicable): [e.g. Crusher, ]

To Reproduce Steps to reproduce the behavior: Maybe just found one application that will run the same kernel many times on GPUs, and then check the difference between filtering by dispatch and kernel

Expected behavior A clear and concise description of what you expected to happen.

Screenshots

Additional context Add any other context about the problem here.

coleramos425 commented 3 months ago

Thanks @bangtianliu. For the record, I've tried reproducing this issue on an MI250 with the latest version of Omniperf (e.g. dev) and could not find the issue. The next step in this ticket would be to try reproducing on an MI300X.

Assigning the issue to project PM for triage.

ppanchad-amd commented 3 weeks ago

Hi @bangtianliu. Internal ticket has been created to investigate this issue on MI300X. Thanks!

jamesxu2 commented 1 week ago

Hi @bangtianliu ,

I tried this test using the convolution application example from ROCm Examples. This example dispatches the same kernel for a user-configurable number of iterations.

However, when I tried to filter by kernel and dispatch, I noticed some inconsistencies.

A specific kernel may be dispatched multiple times, so I'm not sure I understand what the issue is. I think you should expect some inconsistency in the per-kernel statistics and the per-dispatch statistics, since the reported kernel statistic would be an average of the dispatch statistics. Each individual dispatch of the same kernel might run slightly differently, due to other competing workloads on the GPU or other environment variations.

I might be misunderstanding the issue though, so please let me know if this isn't what you're asking about.

Also, in the two screenshots you've provided, only one of them shows the speed of light metric.

bangtianliu commented 1 week ago

Yes, I was talking about the difference between per-kernel and per-dispatch statistics.

jamesxu2 commented 1 week ago

Have I answered your question then @bangtianliu ?

Per kernel metric is an aggregate of the per-dispatch metrics for that specific kernel; you should expect individual dispatches to vary slightly in their metrics due to environmental variations, and for individual dispatches to have metrics that differ from the average.

bangtianliu commented 1 week ago

Yes, but I noticed a significant difference in my case before. Currently, I’m working on an unrelated task, but I may reach out to you later once I can replicate the results.

jamesxu2 commented 1 week ago

Please feel free to reopen this ticket if you have something specific that you want to report.

Also, if you're noticing significant deviation in per-dispatch performance, that might not related to Omniperf at all, and might be an artifact of your workload. You may want to refile such tickets in the appropriate repository unless there's a reason to believe that Omniperf is misreporting those metrics.

bangtianliu commented 1 week ago

Sure, thanks!