Closed lingjiew93 closed 1 year ago
L2 miss rate is not too meaningful sometimes. It is possible to underutilize memory bandwith with good L2 hit rate.
Thanks for your reply. In some cases it's useful to know the memory traffic of L2 and nvidia has some metrics to get the read/write traffic. BTW, gfx908 has 32 TCC_HIT and TCC_MISS instances, but seems like the equation of L2CacheHit only consider half of them.
Hi, Is there any doc for the name abbreviation of counters and metrics? I know some of them, but the other part is really confusing to me. For example, SQ, TA, TA_FLAT, TCC, TCC_EA, TCP I would really appreciate it if you could answer these.
SQ is abbrevation of sequencer - hardware dispatcher. It issues vector alu, scalar alu, branch, memory, local data store, matrix alu instructions. TA, TA_FLAT - texture array. I suppose they are not too helpful in compute workloads. TCC/TCC_EA - L2 cache events. TCP - L1 cache events.
https://developer.amd.com/wp-content/resources/CDNA1_Shader_ISA_14December2020.pdf - precise instruction set description.
https://www.slideshare.net/DevCentralAMD/gs4106-the-amd-gcn-architecture-a-crash-course-by-layla-mah - brief explanation of GCN architecture. CDNA/RDNA architectures are mostly same. I recommend to start from this if you don't have basic understanding how hardware works.
Thanks! Do you have the plan to add the metric of L2 read and write traffic?
TA is texture address block. It calculates effective address of load|store instructions. Then coalesces memory requests to adjacent addresses to one request.
Thanks! Do you have the plan to add the metric of L2 read and write traffic? Same like in gfx906 But it is possible to write program that has all loads of size 1 bytes with strides crafted to fit in different cache blocks. In this case this metric will report 32/64 times more bytes than actually transferred.
Yes, the cacheline size may have some influence on it. Seems like it's still the memory read/write between L2 and HBM. What I'm questioning is the memory read from L2 to L1/LDS and write from L1/LDS to L2. One possible way I'm thinking is using TCC_HIT number with cache line size to calculate it. But need to verify.
Close it as no update.
Hi,
I know there are metrices for HBM(video memory) read and write. Are there any metrics for L2 cache read/write? My card is MI100.