ROCm / rocprofiler

ROC profiler library. Profiling with perf-counters and derived metrics.
https://rocm.docs.amd.com/projects/rocprofiler/en/latest/
Other
116 stars 44 forks source link

L2 Cache read/write metrics #85

Closed lingjiew93 closed 1 year ago

lingjiew93 commented 2 years ago

Hi,

I know there are metrices for HBM(video memory) read and write. Are there any metrics for L2 cache read/write? My card is MI100.

kikimych commented 2 years ago

https://github.com/ROCm-Developer-Tools/rocprofiler/blob/amd-master/test/tool/metrics.xml#L199

kikimych commented 2 years ago

L2 miss rate is not too meaningful sometimes. It is possible to underutilize memory bandwith with good L2 hit rate.

lingjiew93 commented 2 years ago

Thanks for your reply. In some cases it's useful to know the memory traffic of L2 and nvidia has some metrics to get the read/write traffic. BTW, gfx908 has 32 TCC_HIT and TCC_MISS instances, but seems like the equation of L2CacheHit only consider half of them.

kikimych commented 2 years ago

https://github.com/ROCm-Developer-Tools/rocprofiler/pull/87/files

lingjiew93 commented 2 years ago

Hi, Is there any doc for the name abbreviation of counters and metrics? I know some of them, but the other part is really confusing to me. For example, SQ, TA, TA_FLAT, TCC, TCC_EA, TCP I would really appreciate it if you could answer these.

kikimych commented 2 years ago

SQ is abbrevation of sequencer - hardware dispatcher. It issues vector alu, scalar alu, branch, memory, local data store, matrix alu instructions. TA, TA_FLAT - texture array. I suppose they are not too helpful in compute workloads. TCC/TCC_EA - L2 cache events. TCP - L1 cache events.

https://developer.amd.com/wp-content/resources/CDNA1_Shader_ISA_14December2020.pdf - precise instruction set description.

https://www.slideshare.net/DevCentralAMD/gs4106-the-amd-gcn-architecture-a-crash-course-by-layla-mah - brief explanation of GCN architecture. CDNA/RDNA architectures are mostly same. I recommend to start from this if you don't have basic understanding how hardware works.

lingjiew93 commented 2 years ago

Thanks! Do you have the plan to add the metric of L2 read and write traffic?

kikimych commented 2 years ago

TA is texture address block. It calculates effective address of load|store instructions. Then coalesces memory requests to adjacent addresses to one request.

kikimych commented 2 years ago

Thanks! Do you have the plan to add the metric of L2 read and write traffic? Same like in gfx906 But it is possible to write program that has all loads of size 1 bytes with strides crafted to fit in different cache blocks. In this case this metric will report 32/64 times more bytes than actually transferred.

lingjiew93 commented 2 years ago

Yes, the cacheline size may have some influence on it. Seems like it's still the memory read/write between L2 and HBM. What I'm questioning is the memory read from L2 to L1/LDS and write from L1/LDS to L2. One possible way I'm thinking is using TCC_HIT number with cache line size to calculate it. But need to verify.

lingjiew93 commented 1 year ago

Close it as no update.