Add MI300 details to docs

Performance model

[ ] Update Performance model to mention support for MI300
- [ ] Same in MFMA section
- [x] May also want to adopt something like "MI2XX" for MI300A/300X here

[ ] Add MI300 to list of products with MFMA units here
[x] Update note at bottom of section to include MI300 in list of accelerators with 8 waveslots / SIMD here

[ ] Need to add new MFMA instruction metrics for MI300 here
- [ ] And FLOPs for the same here

[ ] On MI300, we now theoretically use the scratch* instructions for stack/spill access, which ... invalidates a lot of this section. We need to figure out how to rework this

[ ] Need to update size and how many CUs it's shared between here - 64KB / shared between CUs on MI300

L2 is no longer coherence point for MI300+
- [ ] L2<->EA request flow diagram needs to be updated for MI300 - Essentially, we need to add a 128B read request line and figure out how to represent this on the diagram
[ ] Update channel count in text for MI300 here - 16 channels per XCC, still 256B interleaved
[ ] Update Streaming requests text to also include 300
[ ] Update probe requests text for MI300 - Likely more involved, need to write some tests to see what triggers these here
[ ] Update note at bottom of section to include MI300 here - [ ] 128B cache-line there as well
[ ] L2-Fabric Write and Atomic Bandwidth - All atomics are now counted as such on MI300, because they are not cached in L2 and must go to MALL - Same with: - HBM Write and Atomic Traffic - Remote Write and Atomic Traffic - Atomic Traffic - Uncached Write and Atomic Traffic
Detailed transaction metrics: here - Need to add 128B read request metric to table

[ ] Need to update table for MI300, may need a better way to represent this as fine-grained/coarse-grained isn't super relevant there anymore.

[ ] Need to discuss XCC / NPS / partitioning modes somewhere. There's no super logical place to do so, but we might do this in the definitions or as s seperate part of the performance model.
[ ] The key points for Omniperf are that: - [ ] Number of CUs depends on # of XCCs active in the current partitioning mode - [ ] Number of HBM channels per partition (and thus: the achievable L2<->EA bandwidth) depends on the NPS mode
[ ] Need to discuss MALL as coherence point somewhere
[ ] Neither of the above need to be in significant detail, IMO
[ ] Neither of these have specific metrics tied to them, but are important to understand how we're presenting data