[ ] Need to update L1 cache-line size here to 128B for MI300+: here
UTCL1
[ ] MI300 fixes the bug where hit-on-miss isn't counted: update here
TA instruction counts
[ ] On MI300, we now theoretically use the scratch* instructions for stack/spill access, which ... invalidates a lot of this section. We need to figure out how to rework this
Scalar / Instruction cache
[ ] Need to update size and how many CUs it's shared between here
- 64KB / shared between CUs on MI300
L2
L2 is no longer coherence point for MI300+
[ ] L2<->EA request flow diagram needs to be updated for MI300
- Essentially, we need to add a 128B read request line and figure out how to represent this on the diagram
[ ] Update channel count in text for MI300 here
- 16 channels per XCC, still 256B interleaved
[ ] Update Streaming requests text to also include 300
[ ] Update probe requests text for MI300
- Likely more involved, need to write some tests to see what triggers these here
[ ] Update note at bottom of section to include MI300 here
- [ ] 128B cache-line there as well
[ ] L2-Fabric Write and Atomic Bandwidth
- All atomics are now counted as such on MI300, because they are not cached in L2 and must go to MALL
- Same with:
- HBM Write and Atomic Traffic
- Remote Write and Atomic Traffic
- Atomic Traffic
- Uncached Write and Atomic Traffic
Detailed transaction metrics: here
- Need to add 128B read request metric to table
Memory type
[ ] Need to update table for MI300, may need a better way to represent this as fine-grained/coarse-grained isn't super relevant there anymore.
New concepts
[ ] Need to discuss XCC / NPS / partitioning modes somewhere. There's no super logical place to do so, but we might do this in the definitions or as s seperate part of the performance model.
[ ] The key points for Omniperf are that:
- [ ] Number of CUs depends on # of XCCs active in the current partitioning mode
- [ ] Number of HBM channels per partition (and thus: the achievable L2<->EA bandwidth) depends on the NPS mode
[ ] Need to discuss MALL as coherence point somewhere
[ ] Neither of the above need to be in significant detail, IMO
[ ] Neither of these have specific metrics tied to them, but are important to understand how we're presenting data
demo build: https://advanced-micro-devices-demo--446.com.readthedocs.build/projects/omniperf/en/446/
Performance model
Pipeline descriptions
VALU
AGPRs
Pipeline metrics
L1
UTCL1
TA instruction counts
Scalar / Instruction cache
L2
Memory type
New concepts
References