ROCm / rocprofiler-compute

Advanced Profiling and Analytics for AMD Hardware
https://rocm.docs.amd.com/projects/omniperf/en/latest/
MIT License
135 stars 49 forks source link

Add L1<->L2 bandwidth calculation #37

Closed skyreflectedinmirrors closed 1 year ago

skyreflectedinmirrors commented 1 year ago

Omniperf currently does not report the achieved L2 bandwidth from the L1s, despite collecting the counters required to do so. Following the convention for L1 bandwidth calculations, this is essentially the total amount of data moved from L1<->L2, which can be calculated from the L1<->L2 requests, e.g.:

https://github.com/AMDResearch/omniperf/blob/main/src/omniperf_cli/configs/gfx90a/1600_L1_cache.yaml#L173

The L2 bandwidth calculation would be:

L2 BW = 64B * (TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum) / $denom
skyreflectedinmirrors commented 1 year ago

It was pointed out that L1<->L2 bandwidth is a better naming scheme for this, as this excludes instruction cache and scalar cache traffic.

coleramos425 commented 1 year ago

Is L1D Cache Accesses panel the only place we want to add this @arghdos ? To summarize, I:

  1. Changed name of L1-TCR -> L1-TCC
  2. Added L1-L2 BW using:
    L1_L2_BW = 64B * (TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum) / $denom

Before: image

After: image

skyreflectedinmirrors commented 1 year ago

Changed name of L1-TCR -> L1-TCC

Change to match the others (L1-L2)

L1-L2 BW

I would prefer we separate the requests and the bandwidth value (right now, you do L1-L2 Read Requests, L1-L2 Bandwidth, L1-L2 Write requests...)

Additionally, make sure the the units the L1-L2 BW are Bytes per $denom, not Requests per $denom, as they are now

coleramos425 commented 1 year ago

Merged fixes. Closing ticket.