Add Performance Groups for Caches to Intel SapphireRapids

chriswasser commented 7 months ago

Hi @TomTheBear,

This PR adds the two performance groups L2CACHE and L3CACHE to the Intel SapphireRapids architecture. These groups also exist for previous architectures and were adapted from the existing group files. The following adoptions were made:

In SapphireRapids, the L2_TRANS_ALL_REQUESTS event of the L2CACHE group no longer exists and was replaced by the L2_REQUEST_ALL event. For consistency within the group file L2_RQSTS_MISS was replaced by its alias L2_REQUEST_MISS.
As this comment ("With Intel Icelake, the retired UOPs cannot be measured anymore but instead, [...]") indicates, SapphireRapids no longer supports the counting of all retired UOPs. Therefore, the L3CACHE group uses the event INSTR_RETIRED_ANY as a basis for the rates. This matches the calculation in the L2CACHE group although it deviates from the calculation for Icelake which used the UOPS_RETIRED_SLOTS event.

The new groups were tested with the stream test from the LIKWID benchmark suite. As a sanity check the counter values were compared with enabled and disabled prefetchers as well as on SkylakeX and SapphireRapids. The reported values seemed sensible while the two architectures exhibited the qualitatively same behavior.

Hope this is helpful and can be integrated into LIKWID. Thanks for the consideration 🤓

Greetings

Christian

TomTheBear commented 7 months ago

Why didn't you take the events used by Intel in it's metrics? Example (l2_mpi == L2 miss rate): https://github.com/intel/perfmon/blob/main/SPR/metrics/sapphirerapids_metrics.json#L222-L244

Please add a comment that you use different events to the LONG section. In the L3CACHE group with INSTR_RETIRED_ANY but uop events for the hits/misses, you can get values > 100%.

chriswasser commented 7 months ago

Ad 1

L2_LINES_IN.ALL seems to measure cache lines not the number of requests. In the L3 performance groups it is used to measure the data volume instead of the number of L2 misses. Since this has remained the same for a dozen or so intel architectures, I wanted to remain consistent with previous groups. A quick test exhibits a higher value for L2_LINES_IN_ALL than for L2_REQUEST_ALL resulting in a miss ratio of more than 100%, which does not seem sensible to me:

|     L2_REQUEST_ALL    |   PMC0  |   312527000 |
|    L2_REQUEST_MISS    |   PMC1  |   312525300 |
|    L2_LINES_IN_ALL    |   PMC2  |   312529800 |

Therefore, I would advocate to keep the succinctly named L2_REQUEST_MISS event.

Ad 2

I read "Counts retired load instructions with at least one uop that hit in the L3 cache." for the MEM_LOAD_RETIRED.L3_HIT event (see: here similarly for the MEM_LOAD_RETIRED.L3_MISS event) and thought it would count only once for each instruction and not multiple times if the instructions decodes to multiple micro-operations. Or is this wrongly documented / easily misunderstood?

Of course, I could still add a comment about the changed basis for the rates as done in the Icelake groups if you would want this change documented.

TomTheBear commented 7 months ago

I would say it is not well documented. The same event code refers to uops until Broadwell. Since Skylake, it uses the instruction count according to the docs. Published metrics like this one combine the MEM_LOAD_* event with instructions, so it is likely that the events really count instructions now.

RRZE-HPC / likwid

Add Performance Groups for Caches to Intel SapphireRapids #594

Ad 1

Ad 2