Closed philipturner closed 1 year ago
I'm afraid I don't follow exactly what your hypothesis is. I'm also not convinced by a number of things in your README.
I'll try to describe where I'm at. First, some terminology:
There's a 1:1 correspondence between CPU clusters and AMX clusters, and on die shots you'll see them colocated, along with a bunch of L2. Note that the clock speed of the AMX cluster needn't equal the clock speed of the associated CPU cluster.
To satisfy the needs of the ISA, each AMX cluster needs to contain:
The X and Y register files (or a combined X and Y register file) will be separate from the AMX cells, but the Z register file is likely split up and colocated in the AMX cells. The more AMX cells there are in a cluster, the smaller the amount of Z register file in each cell. If you had an 8x8 grid of cells, then you'd only need 64 bytes of Z (per CPU core) per cell. If you had an 8x2 grid of cells, then you'd need 256 bytes of Z (per CPU core) per cell. In particular, this could manifest itself as E AMX clusters having fewer cells than P AMX clusters, but each E cell being slightly larger than a P cell.
If you had an 8x8 grid of cells, then an entire AMX matrix instruction could be dispatched at once, whereas an 8x2 or 4x4 grid would require that each matrix instruction be split up into four pieces. If considering vector instructions, a 4x4 grid would require that vector instructions be split up into two pieces, whereas no splitting would be required for an 8x2 grid.
Each cell contains an F64 FMA circuit. Said circuit can be split up and used as four separate F32 FMA circuits, or split up in other ways for F16/BF16/integer-multiply-accumulate. That circuit might be split into four pipeline stages (e.g. 4 cycle compute latency), or at least the path from addend input to FMA output is four cycles. To fully saturate this circuit, each cell needs to be tasked with an F64 FMA per cycle (or four F32 FMAs, or ...). Given that the latency is 4 cycles, four distinct ranges of Z are required. If a matrix instruction refers to all 4096 bytes of Z, then multiple CPU cores need to be in play. This is why my performance tables have threads on one axis, and Z Accumulators per thread on the other axis, as both are routes to getting more Z in play, and sometimes you're constrained by Z before you're constrained by FMA circuits. Of course, adding threads can also put more AMX clusters in play (though you're at the whims of the scheduler as to whether your threads end up on different clusters or not).
The CPU cluster only sends instructions to the AMX cluster, not data. The ALUs and register files on the CPU cores are basically irrelevant to the AMX cluster. Data has to move via memory, and in particular via L2. The interesting question is how much bandwidth there is between the AMX cluster's register files and L2. A secondary question is how quickly CPU cores can enqueue AMX instructions - for M1 that means store ports, of which there are 2 per P core, and 1 per E core. Instruction fusion might let you enqueue two AMX instructions per port per cycle (again note that the dequeue needn't happen in the same clock domain).
If there's a major change in AMX between M1 and M2, I'd expect it to be in the layout of the E cells in the E AMX cluster, potentially switching from a 4x4 layout to an 8x2 layout (note that this logical layout, and needn't directly or exactly correspond to physical layout). This wouldn't change the number of E cells in the cluster, but would improve performance for vector operations. P AMX may well be 8x8 cells in both M1 and M2, with no major change there (except that said 8x8 might now be 4 copies of 8x2 rather than 4 copies of 4x4). Performance in most other ways would be mostly unchanged.
Thanks for the explanation! As stated on the top of my README, I think my current explanation is misleading - just haven't had the time to fix it. I still have some questions.
Have you seen the M2 Pro die shot? The P-AMX looks physically much different than with M1, almost double the area. That made me suspect Apple doubled performance with that generation (3.3 TFLOPS FP32 -> 7.3 TFLOPS FP32), and my M1 Max was seriously behind the performance of M2 Max. Hopefully this is not true.
162 x 75 pixels = 12150 pixels^2, 4 rectangles
120 x 185 pixels x 8/9 = 19733 pixels^2, 8 rectangles
Second, the vector throughput. M2 supports performing a vector instruction on 4 registers at once, while M1 only supports 1 register. Does the M2 have quadruple the vector throughput for non-GEMM-like operations? Or is it just an ISA optimization with no physical performance implications?
Third, clock speed. I know that if you activate more CPU cores, the entire blocks' max clock speed throttles. Does this throttling affect the AMX too, as if the AMX coprocessor is a fifth CPU core?
Finally, BF16 performance. Is there any inherent reason why FP16 FLOPs cannot exceed 2x FP32 FLOPs? Perhaps Z registers exceeding capacity because too many products are formed. If FP32 data consumes 2x the space of FP16 data, that would explain the following paradigm:
Real-life code usually cannot reach 100% ALU utilization; 50-80% is common. Designing the AMX2 to hard-code 50% FP16 utilization would make sense when effective TFLOPS is:
And as a final touch, simply remove half of the FP16xFP16=FP16 multipliers (M1 design). Redirect FP16xFP16=FP32 to the FP32xFP32=FP32 path and you use less transistors (M1 design). I hypothesize that Apple redesigned the A15/M2/M2 Pro (again, please look at the die shot) to (a) fix some Z-register bottleneck and through ISA improvements (b) improve effective TFLOPS. Then it would make practical sense to architect such that BF16 = 4x FP32, not 2x.
For matfp FP64xFP64=FP64, I'm seeing the following on M1 Max: |
Z Accumulators | 1 Thread | 2 Threads | 3 Threads | 4 Threads | 5 Threads | 6 Threads |
---|---|---|---|---|---|---|---|
1 (512 bytes) per thread | 92.8 GFLOPS | 185.5 GFLOPS | 214.0 GFLOPS | 285.0 GFLOPS | 381.3 GFLOPS | 394.0 GFLOPS | |
2 (1024 bytes) per thread | 185.4 GFLOPS | 370.7 GFLOPS | 337.4 GFLOPS | 474.0 GFLOPS | 658.9 GFLOPS | 610.7 GFLOPS | |
3 (1536 bytes) per thread | 278.0 GFLOPS | 556.5 GFLOPS | 472.8 GFLOPS | 572.9 GFLOPS | 646.6 GFLOPS | 718.9 GFLOPS | |
4 (2048 bytes) per thread | 370.8 GFLOPS | 742.0 GFLOPS | 610.4 GFLOPS | 730.0 GFLOPS | 747.2 GFLOPS | 772.0 GFLOPS | |
5 (2560 bytes) per thread | 370.9 GFLOPS | 742.1 GFLOPS | 610.6 GFLOPS | 745.9 GFLOPS | 700.9 GFLOPS | 731.3 GFLOPS | |
6 (3072 bytes) per thread | 371.0 GFLOPS | 741.3 GFLOPS | 608.8 GFLOPS | 730.9 GFLOPS | 727.4 GFLOPS | 735.6 GFLOPS | |
7 (3584 bytes) per thread | 370.7 GFLOPS | 740.9 GFLOPS | 610.4 GFLOPS | 752.7 GFLOPS | 700.4 GFLOPS | 769.8 GFLOPS | |
8 (4096 bytes) per thread | 370.9 GFLOPS | 741.2 GFLOPS | 651.1 GFLOPS | 745.1 GFLOPS | 796.2 GFLOPS | 780.4 GFLOPS |
On M2, the same thing is: | Z Accumulators | 1 Thread | 2 Threads | 3 Threads | 4 Threads | 5 Threads | 6 Threads |
---|---|---|---|---|---|---|---|
1 (512 bytes) per thread | 102.3 GFLOPS | 164.6 GFLOPS | 246.9 GFLOPS | 231.2 GFLOPS | 288.9 GFLOPS | 318.3 GFLOPS | |
2 (1024 bytes) per thread | 204.6 GFLOPS | 252.5 GFLOPS | 378.7 GFLOPS | 354.2 GFLOPS | 384.7 GFLOPS | 436.2 GFLOPS | |
3 (1536 bytes) per thread | 306.9 GFLOPS | 351.4 GFLOPS | 434.3 GFLOPS | 421.4 GFLOPS | 438.6 GFLOPS | 464.0 GFLOPS | |
4 (2048 bytes) per thread | 409.2 GFLOPS | 452.2 GFLOPS | 468.8 GFLOPS | 472.5 GFLOPS | 476.5 GFLOPS | 479.4 GFLOPS | |
5 (2560 bytes) per thread | 409.2 GFLOPS | 452.2 GFLOPS | 468.8 GFLOPS | 472.6 GFLOPS | 476.6 GFLOPS | 479.4 GFLOPS | |
6 (3072 bytes) per thread | 409.2 GFLOPS | 452.2 GFLOPS | 468.8 GFLOPS | 472.6 GFLOPS | 476.5 GFLOPS | 479.4 GFLOPS | |
7 (3584 bytes) per thread | 409.2 GFLOPS | 452.1 GFLOPS | 468.8 GFLOPS | 472.6 GFLOPS | 476.6 GFLOPS | 479.4 GFLOPS | |
8 (4096 bytes) per thread | 409.2 GFLOPS | 452.2 GFLOPS | 468.8 GFLOPS | 472.5 GFLOPS | 476.4 GFLOPS | 479.3 GFLOPS |
The "1 Thread" column sees a ~10% uplift in performance, consistent with M2 clocks being 10% higher than M1. M1 Max gets ~100% improvement going from 1 thread to 2, which is consistent with M1 Max having two P clusters. The dip when going to 3 threads is likely a consequence of the same thing, with there being no good scheduling of 3 threads onto two AMX clusters. M2 sees only small gains from increasing thread count, as a single thread is able to almost saturate the entire AMX cluster.
Going down to matfp FP32xFP32=FP32, I'm seeing this on M1 Max: |
Z Accumulators | 1 Thread | 2 Threads | 3 Threads | 4 Threads | 5 Threads | 6 Threads |
---|---|---|---|---|---|---|---|
1 (1024 bytes) per thread | 370.9 GFLOPS | 741.8 GFLOPS | 857.0 GFLOPS | 1264.1 GFLOPS | 1425.0 GFLOPS | 1354.2 GFLOPS | |
2 (2048 bytes) per thread | 742.5 GFLOPS | 1484.2 GFLOPS | 1349.1 GFLOPS | 1800.1 GFLOPS | 2249.9 GFLOPS | 2389.8 GFLOPS | |
3 (3072 bytes) per thread | 1112.9 GFLOPS | 2224.5 GFLOPS | 1891.6 GFLOPS | 2521.1 GFLOPS | 2591.9 GFLOPS | 2879.2 GFLOPS | |
4 (4096 bytes) per thread | 1482.9 GFLOPS | 2967.4 GFLOPS | 2442.5 GFLOPS | 3102.8 GFLOPS | 2806.2 GFLOPS | 3007.9 GFLOPS |
And on M2: | Z Accumulators | 1 Thread | 2 Threads | 3 Threads | 4 Threads | 5 Threads | 6 Threads |
---|---|---|---|---|---|---|---|
1 (1024 bytes) per thread | 409.2 GFLOPS | 658.5 GFLOPS | 987.5 GFLOPS | 925.6 GFLOPS | 1155.2 GFLOPS | 1272.9 GFLOPS | |
2 (2048 bytes) per thread | 818.4 GFLOPS | 1009.8 GFLOPS | 1514.6 GFLOPS | 1419.8 GFLOPS | 1542.8 GFLOPS | 1743.8 GFLOPS | |
3 (3072 bytes) per thread | 1227.5 GFLOPS | 1405.5 GFLOPS | 1737.0 GFLOPS | 1687.2 GFLOPS | 1755.1 GFLOPS | 1856.0 GFLOPS | |
4 (4096 bytes) per thread | 1636.9 GFLOPS | 1808.5 GFLOPS | 1874.9 GFLOPS | 1889.7 GFLOPS | 1907.3 GFLOPS | 1916.7 GFLOPS |
FP32 is getting 4x the performance of FP64. Other than that, all the previous remarks apply here basically verbatim.
Going down to FP16xFP16=FP32, M1 Max: | Z Accumulators | 1 Thread | 2 Threads | 3 Threads | 4 Threads | 5 Threads | 6 Threads |
---|---|---|---|---|---|---|---|
1 (4096 bytes) per thread | 1483.7 GFLOPS | 2967.8 GFLOPS | 2272.4 GFLOPS | 2457.5 GFLOPS | 2808.7 GFLOPS | 2586.7 GFLOPS |
M2: | Z Accumulators | 1 Thread | 2 Threads | 3 Threads | 4 Threads | 5 Threads | 6 Threads |
---|---|---|---|---|---|---|---|
1 (4096 bytes) per thread | 1636.8 GFLOPS | 1639.4 GFLOPS | 1638.6 GFLOPS | 1397.1 GFLOPS | 1703.9 GFLOPS | 1704.0 GFLOPS |
Similar performance to FP32xFP32=FP32.
On M2, we can also do BF16xBF16=FP32: | Z Accumulators | 1 Thread | 2 Threads | 3 Threads | 4 Threads | 5 Threads | 6 Threads |
---|---|---|---|---|---|---|---|
1 (4096 bytes) per thread | 1636.7 GFLOPS | 1639.5 GFLOPS | 1638.5 GFLOPS | 1397.2 GFLOPS | 1703.8 GFLOPS | 1704.0 GFLOPS |
Same performance as FP16.
Going down to FP16xFP16=FP16, M1 Max:
Z Accumulators | 1 Thread | 2 Threads | 3 Threads | 4 Threads | 5 Threads | 6 Threads |
---|---|---|---|---|---|---|
1 (2048 bytes) per thread | 1484.0 GFLOPS | 2965.0 GFLOPS | 2693.9 GFLOPS | 3599.1 GFLOPS | 4543.6 GFLOPS | 5124.9 GFLOPS |
2 (4096 bytes) per thread | 2967.1 GFLOPS | 5926.1 GFLOPS | 4881.3 GFLOPS | 6189.6 GFLOPS | 6072.7 GFLOPS | 5337.5 GFLOPS |
M2: | Z Accumulators | 1 Thread | 2 Threads | 3 Threads | 4 Threads | 5 Threads | 6 Threads |
---|---|---|---|---|---|---|---|
1 (2048 bytes) per thread | 1637.0 GFLOPS | 2026.2 GFLOPS | 3029.0 GFLOPS | 2867.8 GFLOPS | 3384.8 GFLOPS | 3408.3 GFLOPS | |
2 (4096 bytes) per thread | 3272.8 GFLOPS | 3614.3 GFLOPS | 3751.6 GFLOPS | 2905.1 GFLOPS | 3394.9 GFLOPS | 3407.8 GFLOPS |
Twice the performance of FP32.
On M2, we can also do BF16xBF16=BF16: | Z Accumulators | 1 Thread | 2 Threads | 3 Threads | 4 Threads | 5 Threads | 6 Threads |
---|---|---|---|---|---|---|---|
1 (2048 bytes) per thread | 1636.8 GFLOPS | 2026.1 GFLOPS | 3028.9 GFLOPS | 2865.6 GFLOPS | 3385.4 GFLOPS | 3407.3 GFLOPS | |
2 (4096 bytes) per thread | 3273.6 GFLOPS | 3613.7 GFLOPS | 3750.4 GFLOPS | 2915.7 GFLOPS | 3395.3 GFLOPS | 3406.4 GFLOPS |
Same performance as FP16.
For vecfp FP64xFP64=FP64, I'm seeing the following on M1 Max: | Z Accumulators | 1 Thread | 2 Threads | 3 Threads | 4 Threads | 5 Threads | 6 Threads |
---|---|---|---|---|---|---|---|
1 (64 bytes) per thread | 11.6 GFLOPS | 23.2 GFLOPS | 26.7 GFLOPS | 39.4 GFLOPS | 44.3 GFLOPS | 52.0 GFLOPS | |
2 (128 bytes) per thread | 23.2 GFLOPS | 46.4 GFLOPS | 53.5 GFLOPS | 71.2 GFLOPS | 88.9 GFLOPS | 102.9 GFLOPS | |
3 (192 bytes) per thread | 34.7 GFLOPS | 69.5 GFLOPS | 80.1 GFLOPS | 107.0 GFLOPS | 125.4 GFLOPS | 122.5 GFLOPS | |
4 (256 bytes) per thread | 46.3 GFLOPS | 92.7 GFLOPS | 106.6 GFLOPS | 138.8 GFLOPS | 165.0 GFLOPS | 147.8 GFLOPS | |
5 (320 bytes) per thread | 58.0 GFLOPS | 115.9 GFLOPS | 120.3 GFLOPS | 147.7 GFLOPS | 180.3 GFLOPS | 175.4 GFLOPS | |
6 (384 bytes) per thread | 69.5 GFLOPS | 138.8 GFLOPS | 135.9 GFLOPS | 182.6 GFLOPS | 192.8 GFLOPS | 193.8 GFLOPS | |
7 (448 bytes) per thread | 81.1 GFLOPS | 162.2 GFLOPS | 152.1 GFLOPS | 192.3 GFLOPS | 198.8 GFLOPS | 202.2 GFLOPS | |
8 (512 bytes) per thread | 92.8 GFLOPS | 185.1 GFLOPS | 168.4 GFLOPS | 201.3 GFLOPS | 201.1 GFLOPS | 211.2 GFLOPS | |
9 (576 bytes) per thread | 89.0 GFLOPS | 177.3 GFLOPS | 162.3 GFLOPS | 198.6 GFLOPS | 199.9 GFLOPS | 206.6 GFLOPS | |
10 (640 bytes) per thread | 91.9 GFLOPS | 181.3 GFLOPS | 165.6 GFLOPS | 199.3 GFLOPS | 200.7 GFLOPS | 205.9 GFLOPS | |
11 (704 bytes) per thread | 91.4 GFLOPS | 181.6 GFLOPS | 166.4 GFLOPS | 202.1 GFLOPS | 199.2 GFLOPS | 198.5 GFLOPS | |
12 (768 bytes) per thread | 92.7 GFLOPS | 185.0 GFLOPS | 167.8 GFLOPS | 203.2 GFLOPS | 201.7 GFLOPS | 208.8 GFLOPS | |
13 (832 bytes) per thread | 92.8 GFLOPS | 185.3 GFLOPS | 168.3 GFLOPS | 187.4 GFLOPS | 200.6 GFLOPS | 208.3 GFLOPS | |
14 (896 bytes) per thread | 92.8 GFLOPS | 184.1 GFLOPS | 168.6 GFLOPS | 203.1 GFLOPS | 201.4 GFLOPS | 209.6 GFLOPS | |
15 (960 bytes) per thread | 92.7 GFLOPS | 185.4 GFLOPS | 167.1 GFLOPS | 202.7 GFLOPS | 200.6 GFLOPS | 208.1 GFLOPS | |
16 (1024 bytes) per thread | 92.7 GFLOPS | 185.0 GFLOPS | 168.6 GFLOPS | 202.4 GFLOPS | 204.0 GFLOPS | 209.0 GFLOPS |
And on M2: | Z Accumulators | 1 Thread | 2 Threads | 3 Threads | 4 Threads | 5 Threads | 6 Threads |
---|---|---|---|---|---|---|---|
1 (64 bytes) per thread | 12.8 GFLOPS | 20.6 GFLOPS | 30.9 GFLOPS | 39.3 GFLOPS | 49.1 GFLOPS | 58.9 GFLOPS | |
2 (128 bytes) per thread | 25.6 GFLOPS | 41.1 GFLOPS | 61.7 GFLOPS | 78.6 GFLOPS | 87.1 GFLOPS | 85.2 GFLOPS | |
3 (192 bytes) per thread | 38.4 GFLOPS | 61.7 GFLOPS | 89.6 GFLOPS | 95.2 GFLOPS | 108.1 GFLOPS | 107.5 GFLOPS | |
4 (256 bytes) per thread | 51.2 GFLOPS | 82.3 GFLOPS | 118.1 GFLOPS | 115.8 GFLOPS | 129.3 GFLOPS | 132.0 GFLOPS | |
5 (320 bytes) per thread | 63.9 GFLOPS | 102.9 GFLOPS | 139.6 GFLOPS | 132.4 GFLOPS | 143.0 GFLOPS | 145.3 GFLOPS | |
6 (384 bytes) per thread | 76.7 GFLOPS | 113.1 GFLOPS | 149.8 GFLOPS | 143.2 GFLOPS | 151.6 GFLOPS | 154.9 GFLOPS | |
7 (448 bytes) per thread | 89.5 GFLOPS | 123.5 GFLOPS | 150.3 GFLOPS | 146.0 GFLOPS | 152.0 GFLOPS | 154.2 GFLOPS | |
8 (512 bytes) per thread | 102.3 GFLOPS | 134.8 GFLOPS | 150.8 GFLOPS | 150.6 GFLOPS | 154.0 GFLOPS | 155.1 GFLOPS | |
9 (576 bytes) per thread | 102.3 GFLOPS | 135.2 GFLOPS | 151.5 GFLOPS | 149.6 GFLOPS | 153.0 GFLOPS | 153.9 GFLOPS | |
10 (640 bytes) per thread | 102.3 GFLOPS | 135.3 GFLOPS | 151.6 GFLOPS | 150.6 GFLOPS | 154.0 GFLOPS | 154.5 GFLOPS | |
11 (704 bytes) per thread | 102.2 GFLOPS | 138.3 GFLOPS | 154.9 GFLOPS | 150.6 GFLOPS | 153.6 GFLOPS | 154.8 GFLOPS | |
12 (768 bytes) per thread | 102.3 GFLOPS | 135.2 GFLOPS | 151.5 GFLOPS | 150.7 GFLOPS | 154.4 GFLOPS | 154.9 GFLOPS | |
13 (832 bytes) per thread | 102.3 GFLOPS | 137.8 GFLOPS | 154.3 GFLOPS | 150.5 GFLOPS | 153.6 GFLOPS | 155.4 GFLOPS | |
14 (896 bytes) per thread | 102.3 GFLOPS | 135.1 GFLOPS | 151.5 GFLOPS | 150.5 GFLOPS | 154.5 GFLOPS | 155.4 GFLOPS | |
15 (960 bytes) per thread | 102.3 GFLOPS | 137.8 GFLOPS | 154.2 GFLOPS | 150.6 GFLOPS | 153.7 GFLOPS | 154.8 GFLOPS | |
16 (1024 bytes) per thread | 102.3 GFLOPS | 135.2 GFLOPS | 151.5 GFLOPS | 150.8 GFLOPS | 154.1 GFLOPS | 155.4 GFLOPS |
No surprises hiding here.
M2 can dispatch 2 iterations at once: | Z Accumulators | 1 Thread | 2 Threads | 3 Threads | 4 Threads | 5 Threads | 6 Threads |
---|---|---|---|---|---|---|---|
1 (128 bytes) per thread | 25.6 GFLOPS | 41.1 GFLOPS | 61.7 GFLOPS | 78.8 GFLOPS | 98.3 GFLOPS | 117.6 GFLOPS | |
2 (256 bytes) per thread | 51.1 GFLOPS | 82.3 GFLOPS | 123.4 GFLOPS | 157.5 GFLOPS | 174.2 GFLOPS | 169.9 GFLOPS | |
3 (384 bytes) per thread | 76.7 GFLOPS | 123.4 GFLOPS | 156.6 GFLOPS | 169.2 GFLOPS | 177.8 GFLOPS | 176.7 GFLOPS | |
4 (512 bytes) per thread | 102.3 GFLOPS | 164.5 GFLOPS | 175.6 GFLOPS | 176.7 GFLOPS | 178.0 GFLOPS | 177.6 GFLOPS | |
5 (640 bytes) per thread | 102.3 GFLOPS | 164.6 GFLOPS | 175.7 GFLOPS | 176.7 GFLOPS | 179.4 GFLOPS | 177.5 GFLOPS | |
6 (768 bytes) per thread | 102.3 GFLOPS | 164.6 GFLOPS | 175.7 GFLOPS | 176.7 GFLOPS | 179.3 GFLOPS | 177.6 GFLOPS | |
7 (896 bytes) per thread | 102.3 GFLOPS | 164.6 GFLOPS | 175.7 GFLOPS | 176.7 GFLOPS | 179.4 GFLOPS | 178.1 GFLOPS | |
8 (1024 bytes) per thread | 102.3 GFLOPS | 164.6 GFLOPS | 175.7 GFLOPS | 178.0 GFLOPS | 178.6 GFLOPS | 177.7 GFLOPS | |
9 (1152 bytes) per thread | 102.3 GFLOPS | 164.5 GFLOPS | 175.6 GFLOPS | 177.9 GFLOPS | 179.4 GFLOPS | 178.4 GFLOPS | |
10 (1280 bytes) per thread | 102.3 GFLOPS | 164.5 GFLOPS | 175.8 GFLOPS | 177.8 GFLOPS | 179.4 GFLOPS | 179.2 GFLOPS | |
11 (1408 bytes) per thread | 102.3 GFLOPS | 164.6 GFLOPS | 175.6 GFLOPS | 176.7 GFLOPS | 179.3 GFLOPS | 178.3 GFLOPS | |
12 (1536 bytes) per thread | 102.3 GFLOPS | 164.5 GFLOPS | 175.7 GFLOPS | 176.7 GFLOPS | 179.3 GFLOPS | 177.9 GFLOPS | |
13 (1664 bytes) per thread | 102.3 GFLOPS | 164.6 GFLOPS | 175.7 GFLOPS | 176.7 GFLOPS | 179.3 GFLOPS | 177.6 GFLOPS | |
14 (1792 bytes) per thread | 102.3 GFLOPS | 164.6 GFLOPS | 175.6 GFLOPS | 176.8 GFLOPS | 179.3 GFLOPS | 177.6 GFLOPS | |
15 (1920 bytes) per thread | 102.3 GFLOPS | 164.6 GFLOPS | 175.7 GFLOPS | 176.6 GFLOPS | 179.3 GFLOPS | 178.9 GFLOPS | |
16 (2048 bytes) per thread | 102.3 GFLOPS | 164.6 GFLOPS | 175.7 GFLOPS | 176.7 GFLOPS | 179.3 GFLOPS | 178.7 GFLOPS |
No gains to peak FLOPS to be found here, nor for 4 iterations at once: | Z Accumulators | 1 Thread | 2 Threads | 3 Threads | 4 Threads | 5 Threads | 6 Threads |
---|---|---|---|---|---|---|---|
1 (256 bytes) per thread | 51.1 GFLOPS | 82.5 GFLOPS | 123.5 GFLOPS | 93.2 GFLOPS | 126.2 GFLOPS | 152.9 GFLOPS | |
2 (512 bytes) per thread | 102.3 GFLOPS | 164.6 GFLOPS | 175.7 GFLOPS | 102.7 GFLOPS | 163.2 GFLOPS | 159.9 GFLOPS | |
3 (768 bytes) per thread | 102.3 GFLOPS | 163.5 GFLOPS | 175.8 GFLOPS | 102.9 GFLOPS | 163.2 GFLOPS | 162.3 GFLOPS | |
4 (1024 bytes) per thread | 102.3 GFLOPS | 164.5 GFLOPS | 175.7 GFLOPS | 104.9 GFLOPS | 136.8 GFLOPS | 163.1 GFLOPS | |
5 (1280 bytes) per thread | 102.3 GFLOPS | 163.5 GFLOPS | 175.7 GFLOPS | 103.9 GFLOPS | 161.8 GFLOPS | 163.9 GFLOPS | |
6 (1536 bytes) per thread | 102.3 GFLOPS | 163.4 GFLOPS | 175.7 GFLOPS | 102.9 GFLOPS | 163.3 GFLOPS | 163.4 GFLOPS | |
7 (1792 bytes) per thread | 102.3 GFLOPS | 163.5 GFLOPS | 175.7 GFLOPS | 104.9 GFLOPS | 137.1 GFLOPS | 158.5 GFLOPS | |
8 (2048 bytes) per thread | 102.3 GFLOPS | 164.5 GFLOPS | 175.8 GFLOPS | 102.8 GFLOPS | 163.3 GFLOPS | 162.1 GFLOPS | |
9 (2304 bytes) per thread | 102.3 GFLOPS | 164.4 GFLOPS | 175.7 GFLOPS | 104.0 GFLOPS | 163.3 GFLOPS | 162.9 GFLOPS | |
10 (2560 bytes) per thread | 102.3 GFLOPS | 164.6 GFLOPS | 175.7 GFLOPS | 102.5 GFLOPS | 163.3 GFLOPS | 163.1 GFLOPS | |
11 (2816 bytes) per thread | 102.3 GFLOPS | 164.6 GFLOPS | 175.6 GFLOPS | 104.0 GFLOPS | 163.3 GFLOPS | 162.3 GFLOPS | |
12 (3072 bytes) per thread | 102.3 GFLOPS | 164.5 GFLOPS | 175.7 GFLOPS | 103.7 GFLOPS | 163.3 GFLOPS | 160.8 GFLOPS | |
13 (3328 bytes) per thread | 102.3 GFLOPS | 164.5 GFLOPS | 175.6 GFLOPS | 102.8 GFLOPS | 163.2 GFLOPS | 162.7 GFLOPS | |
14 (3584 bytes) per thread | 102.3 GFLOPS | 164.6 GFLOPS | 176.1 GFLOPS | 103.9 GFLOPS | 161.6 GFLOPS | 162.9 GFLOPS | |
15 (3840 bytes) per thread | 102.3 GFLOPS | 164.4 GFLOPS | 175.7 GFLOPS | 103.3 GFLOPS | 162.8 GFLOPS | 162.7 GFLOPS | |
16 (4096 bytes) per thread | 102.3 GFLOPS | 164.6 GFLOPS | 175.5 GFLOPS | 102.6 GFLOPS | 163.1 GFLOPS | 162.5 GFLOPS |
Going down to FP32xFP32=FP32, M1 Max: | Z Accumulators | 1 Thread | 2 Threads | 3 Threads | 4 Threads | 5 Threads | 6 Threads |
---|---|---|---|---|---|---|---|
1 (64 bytes) per thread | 23.2 GFLOPS | 46.4 GFLOPS | 53.3 GFLOPS | 81.1 GFLOPS | 89.0 GFLOPS | 104.1 GFLOPS | |
2 (128 bytes) per thread | 46.4 GFLOPS | 92.7 GFLOPS | 106.5 GFLOPS | 141.3 GFLOPS | 176.8 GFLOPS | 206.5 GFLOPS | |
3 (192 bytes) per thread | 69.6 GFLOPS | 139.1 GFLOPS | 160.1 GFLOPS | 213.3 GFLOPS | 250.6 GFLOPS | 244.9 GFLOPS | |
4 (256 bytes) per thread | 92.7 GFLOPS | 185.4 GFLOPS | 214.0 GFLOPS | 277.6 GFLOPS | 325.5 GFLOPS | 298.0 GFLOPS | |
5 (320 bytes) per thread | 115.8 GFLOPS | 231.7 GFLOPS | 241.0 GFLOPS | 321.3 GFLOPS | 355.1 GFLOPS | 347.7 GFLOPS | |
6 (384 bytes) per thread | 139.0 GFLOPS | 277.7 GFLOPS | 271.2 GFLOPS | 361.7 GFLOPS | 387.1 GFLOPS | 386.2 GFLOPS | |
7 (448 bytes) per thread | 162.2 GFLOPS | 324.2 GFLOPS | 299.9 GFLOPS | 383.4 GFLOPS | 394.0 GFLOPS | 400.9 GFLOPS | |
8 (512 bytes) per thread | 185.5 GFLOPS | 369.9 GFLOPS | 335.8 GFLOPS | 392.9 GFLOPS | 405.8 GFLOPS | 416.0 GFLOPS | |
9 (576 bytes) per thread | 178.0 GFLOPS | 353.4 GFLOPS | 325.5 GFLOPS | 396.9 GFLOPS | 398.0 GFLOPS | 409.2 GFLOPS | |
10 (640 bytes) per thread | 183.1 GFLOPS | 360.6 GFLOPS | 335.3 GFLOPS | 402.4 GFLOPS | 401.2 GFLOPS | 417.2 GFLOPS | |
11 (704 bytes) per thread | 183.1 GFLOPS | 363.0 GFLOPS | 334.2 GFLOPS | 403.2 GFLOPS | 400.6 GFLOPS | 415.8 GFLOPS | |
12 (768 bytes) per thread | 185.2 GFLOPS | 370.6 GFLOPS | 335.5 GFLOPS | 378.5 GFLOPS | 397.7 GFLOPS | 419.0 GFLOPS | |
13 (832 bytes) per thread | 185.2 GFLOPS | 369.4 GFLOPS | 336.0 GFLOPS | 404.2 GFLOPS | 400.9 GFLOPS | 414.1 GFLOPS | |
14 (896 bytes) per thread | 185.5 GFLOPS | 370.5 GFLOPS | 336.4 GFLOPS | 406.0 GFLOPS | 402.9 GFLOPS | 416.4 GFLOPS | |
15 (960 bytes) per thread | 185.5 GFLOPS | 370.0 GFLOPS | 336.8 GFLOPS | 405.7 GFLOPS | 402.6 GFLOPS | 409.6 GFLOPS | |
16 (1024 bytes) per thread | 185.4 GFLOPS | 370.4 GFLOPS | 336.3 GFLOPS | 406.0 GFLOPS | 399.7 GFLOPS | 405.3 GFLOPS |
M2: | Z Accumulators | 1 Thread | 2 Threads | 3 Threads | 4 Threads | 5 Threads | 6 Threads |
---|---|---|---|---|---|---|---|
1 (64 bytes) per thread | 25.6 GFLOPS | 41.2 GFLOPS | 61.7 GFLOPS | 78.7 GFLOPS | 98.4 GFLOPS | 117.7 GFLOPS | |
2 (128 bytes) per thread | 51.2 GFLOPS | 82.3 GFLOPS | 123.5 GFLOPS | 157.7 GFLOPS | 174.1 GFLOPS | 170.4 GFLOPS | |
3 (192 bytes) per thread | 76.7 GFLOPS | 123.4 GFLOPS | 179.5 GFLOPS | 191.0 GFLOPS | 216.9 GFLOPS | 215.1 GFLOPS | |
4 (256 bytes) per thread | 102.2 GFLOPS | 164.6 GFLOPS | 237.1 GFLOPS | 231.8 GFLOPS | 258.3 GFLOPS | 263.3 GFLOPS | |
5 (320 bytes) per thread | 127.8 GFLOPS | 205.7 GFLOPS | 279.1 GFLOPS | 264.8 GFLOPS | 285.7 GFLOPS | 289.5 GFLOPS | |
6 (384 bytes) per thread | 153.5 GFLOPS | 226.0 GFLOPS | 299.5 GFLOPS | 286.6 GFLOPS | 300.5 GFLOPS | 308.3 GFLOPS | |
7 (448 bytes) per thread | 179.0 GFLOPS | 246.6 GFLOPS | 300.6 GFLOPS | 291.4 GFLOPS | 302.4 GFLOPS | 306.2 GFLOPS | |
8 (512 bytes) per thread | 204.4 GFLOPS | 269.7 GFLOPS | 301.6 GFLOPS | 299.4 GFLOPS | 309.2 GFLOPS | 310.4 GFLOPS | |
9 (576 bytes) per thread | 204.6 GFLOPS | 270.5 GFLOPS | 302.9 GFLOPS | 297.9 GFLOPS | 304.7 GFLOPS | 307.3 GFLOPS | |
10 (640 bytes) per thread | 204.7 GFLOPS | 270.3 GFLOPS | 303.0 GFLOPS | 300.2 GFLOPS | 306.9 GFLOPS | 308.9 GFLOPS | |
11 (704 bytes) per thread | 204.6 GFLOPS | 276.5 GFLOPS | 308.4 GFLOPS | 302.1 GFLOPS | 305.8 GFLOPS | 307.5 GFLOPS | |
12 (768 bytes) per thread | 204.5 GFLOPS | 270.5 GFLOPS | 302.9 GFLOPS | 299.9 GFLOPS | 304.2 GFLOPS | 307.5 GFLOPS | |
13 (832 bytes) per thread | 204.6 GFLOPS | 275.3 GFLOPS | 307.9 GFLOPS | 299.8 GFLOPS | 306.4 GFLOPS | 307.4 GFLOPS | |
14 (896 bytes) per thread | 204.2 GFLOPS | 270.5 GFLOPS | 302.9 GFLOPS | 299.6 GFLOPS | 306.9 GFLOPS | 310.6 GFLOPS | |
15 (960 bytes) per thread | 204.5 GFLOPS | 275.7 GFLOPS | 308.5 GFLOPS | 299.5 GFLOPS | 305.5 GFLOPS | 307.4 GFLOPS | |
16 (1024 bytes) per thread | 204.6 GFLOPS | 270.5 GFLOPS | 302.8 GFLOPS | 299.8 GFLOPS | 306.9 GFLOPS | 307.4 GFLOPS |
M2 two at a time: | Z Accumulators | 1 Thread | 2 Threads | 3 Threads | 4 Threads | 5 Threads | 6 Threads |
---|---|---|---|---|---|---|---|
1 (128 bytes) per thread | 51.2 GFLOPS | 82.3 GFLOPS | 123.5 GFLOPS | 157.7 GFLOPS | 196.4 GFLOPS | 235.4 GFLOPS | |
2 (256 bytes) per thread | 102.3 GFLOPS | 164.6 GFLOPS | 246.7 GFLOPS | 316.2 GFLOPS | 346.2 GFLOPS | 339.7 GFLOPS | |
3 (384 bytes) per thread | 153.4 GFLOPS | 246.9 GFLOPS | 313.0 GFLOPS | 338.8 GFLOPS | 355.6 GFLOPS | 348.5 GFLOPS | |
4 (512 bytes) per thread | 204.6 GFLOPS | 328.8 GFLOPS | 351.4 GFLOPS | 355.6 GFLOPS | 356.1 GFLOPS | 349.9 GFLOPS | |
5 (640 bytes) per thread | 204.5 GFLOPS | 329.0 GFLOPS | 351.2 GFLOPS | 355.9 GFLOPS | 356.1 GFLOPS | 349.0 GFLOPS | |
6 (768 bytes) per thread | 204.6 GFLOPS | 329.3 GFLOPS | 351.2 GFLOPS | 350.8 GFLOPS | 353.9 GFLOPS | 356.3 GFLOPS | |
7 (896 bytes) per thread | 204.5 GFLOPS | 329.0 GFLOPS | 346.6 GFLOPS | 350.7 GFLOPS | 356.1 GFLOPS | 357.7 GFLOPS | |
8 (1024 bytes) per thread | 204.7 GFLOPS | 329.3 GFLOPS | 351.4 GFLOPS | 353.6 GFLOPS | 358.2 GFLOPS | 354.9 GFLOPS | |
9 (1152 bytes) per thread | 204.6 GFLOPS | 328.8 GFLOPS | 351.2 GFLOPS | 346.2 GFLOPS | 358.4 GFLOPS | 349.1 GFLOPS | |
10 (1280 bytes) per thread | 204.5 GFLOPS | 329.3 GFLOPS | 351.6 GFLOPS | 351.0 GFLOPS | 354.9 GFLOPS | 355.4 GFLOPS | |
11 (1408 bytes) per thread | 204.4 GFLOPS | 328.9 GFLOPS | 351.3 GFLOPS | 350.8 GFLOPS | 358.3 GFLOPS | 348.7 GFLOPS | |
12 (1536 bytes) per thread | 204.6 GFLOPS | 329.2 GFLOPS | 351.4 GFLOPS | 350.9 GFLOPS | 356.0 GFLOPS | 355.6 GFLOPS | |
13 (1664 bytes) per thread | 204.4 GFLOPS | 329.1 GFLOPS | 351.1 GFLOPS | 350.8 GFLOPS | 356.1 GFLOPS | 356.4 GFLOPS | |
14 (1792 bytes) per thread | 204.7 GFLOPS | 329.0 GFLOPS | 351.2 GFLOPS | 350.9 GFLOPS | 356.2 GFLOPS | 349.7 GFLOPS | |
15 (1920 bytes) per thread | 204.5 GFLOPS | 329.3 GFLOPS | 351.2 GFLOPS | 355.1 GFLOPS | 356.0 GFLOPS | 356.7 GFLOPS | |
16 (2048 bytes) per thread | 204.6 GFLOPS | 328.7 GFLOPS | 351.0 GFLOPS | 350.9 GFLOPS | 356.1 GFLOPS | 348.8 GFLOPS |
And four at a time: | Z Accumulators | 1 Thread | 2 Threads | 3 Threads | 4 Threads | 5 Threads | 6 Threads |
---|---|---|---|---|---|---|---|
1 (256 bytes) per thread | 102.3 GFLOPS | 164.9 GFLOPS | 247.0 GFLOPS | 187.5 GFLOPS | 250.9 GFLOPS | 305.7 GFLOPS | |
2 (512 bytes) per thread | 204.5 GFLOPS | 326.9 GFLOPS | 351.4 GFLOPS | 208.2 GFLOPS | 326.6 GFLOPS | 323.5 GFLOPS | |
3 (768 bytes) per thread | 204.6 GFLOPS | 326.8 GFLOPS | 351.4 GFLOPS | 211.6 GFLOPS | 320.5 GFLOPS | 324.9 GFLOPS | |
4 (1024 bytes) per thread | 204.6 GFLOPS | 329.3 GFLOPS | 351.3 GFLOPS | 205.7 GFLOPS | 326.6 GFLOPS | 325.6 GFLOPS | |
5 (1280 bytes) per thread | 204.6 GFLOPS | 329.1 GFLOPS | 351.2 GFLOPS | 205.4 GFLOPS | 326.5 GFLOPS | 322.4 GFLOPS | |
6 (1536 bytes) per thread | 204.6 GFLOPS | 328.9 GFLOPS | 351.4 GFLOPS | 208.7 GFLOPS | 318.2 GFLOPS | 322.7 GFLOPS | |
7 (1792 bytes) per thread | 204.6 GFLOPS | 329.2 GFLOPS | 351.4 GFLOPS | 205.9 GFLOPS | 326.4 GFLOPS | 324.0 GFLOPS | |
8 (2048 bytes) per thread | 204.5 GFLOPS | 329.1 GFLOPS | 351.4 GFLOPS | 208.1 GFLOPS | 326.5 GFLOPS | 321.3 GFLOPS | |
9 (2304 bytes) per thread | 204.6 GFLOPS | 329.2 GFLOPS | 351.4 GFLOPS | 207.3 GFLOPS | 323.6 GFLOPS | 326.9 GFLOPS | |
10 (2560 bytes) per thread | 204.6 GFLOPS | 329.2 GFLOPS | 351.4 GFLOPS | 206.2 GFLOPS | 320.8 GFLOPS | 326.7 GFLOPS | |
11 (2816 bytes) per thread | 204.5 GFLOPS | 326.9 GFLOPS | 346.5 GFLOPS | 208.2 GFLOPS | 326.3 GFLOPS | 321.5 GFLOPS | |
12 (3072 bytes) per thread | 204.6 GFLOPS | 329.1 GFLOPS | 351.1 GFLOPS | 205.9 GFLOPS | 326.5 GFLOPS | 326.4 GFLOPS | |
13 (3328 bytes) per thread | 204.5 GFLOPS | 329.3 GFLOPS | 351.4 GFLOPS | 206.6 GFLOPS | 323.5 GFLOPS | 323.4 GFLOPS | |
14 (3584 bytes) per thread | 204.6 GFLOPS | 329.2 GFLOPS | 351.4 GFLOPS | 205.5 GFLOPS | 326.4 GFLOPS | 323.2 GFLOPS | |
15 (3840 bytes) per thread | 204.6 GFLOPS | 329.2 GFLOPS | 351.0 GFLOPS | 205.8 GFLOPS | 326.5 GFLOPS | 322.5 GFLOPS | |
16 (4096 bytes) per thread | 204.6 GFLOPS | 327.1 GFLOPS | 351.2 GFLOPS | 208.1 GFLOPS | 324.0 GFLOPS | 321.7 GFLOPS |
FP16xFP16=FP16, M1 Max: | Z Accumulators | 1 Thread | 2 Threads | 3 Threads | 4 Threads | 5 Threads | 6 Threads |
---|---|---|---|---|---|---|---|
1 (64 bytes) per thread | 46.3 GFLOPS | 92.7 GFLOPS | 106.7 GFLOPS | 142.6 GFLOPS | 189.5 GFLOPS | 207.8 GFLOPS | |
2 (128 bytes) per thread | 92.8 GFLOPS | 185.2 GFLOPS | 209.9 GFLOPS | 312.7 GFLOPS | 346.7 GFLOPS | 408.2 GFLOPS | |
3 (192 bytes) per thread | 139.2 GFLOPS | 277.7 GFLOPS | 317.5 GFLOPS | 424.2 GFLOPS | 507.9 GFLOPS | 482.5 GFLOPS | |
4 (256 bytes) per thread | 185.5 GFLOPS | 370.4 GFLOPS | 408.3 GFLOPS | 552.6 GFLOPS | 654.9 GFLOPS | 589.2 GFLOPS | |
5 (320 bytes) per thread | 231.5 GFLOPS | 463.4 GFLOPS | 479.3 GFLOPS | 630.2 GFLOPS | 711.5 GFLOPS | 710.2 GFLOPS | |
6 (384 bytes) per thread | 277.8 GFLOPS | 556.3 GFLOPS | 540.5 GFLOPS | 721.1 GFLOPS | 813.0 GFLOPS | 734.5 GFLOPS | |
7 (448 bytes) per thread | 324.8 GFLOPS | 647.3 GFLOPS | 607.7 GFLOPS | 769.6 GFLOPS | 789.4 GFLOPS | 802.9 GFLOPS | |
8 (512 bytes) per thread | 371.0 GFLOPS | 739.4 GFLOPS | 672.2 GFLOPS | 810.8 GFLOPS | 824.8 GFLOPS | 813.7 GFLOPS | |
9 (576 bytes) per thread | 354.5 GFLOPS | 712.8 GFLOPS | 652.8 GFLOPS | 792.6 GFLOPS | 796.6 GFLOPS | 787.1 GFLOPS | |
10 (640 bytes) per thread | 365.5 GFLOPS | 717.8 GFLOPS | 662.3 GFLOPS | 798.5 GFLOPS | 788.0 GFLOPS | 828.2 GFLOPS | |
11 (704 bytes) per thread | 365.2 GFLOPS | 731.5 GFLOPS | 670.2 GFLOPS | 799.1 GFLOPS | 806.8 GFLOPS | 825.8 GFLOPS | |
12 (768 bytes) per thread | 371.3 GFLOPS | 740.5 GFLOPS | 670.5 GFLOPS | 797.7 GFLOPS | 804.7 GFLOPS | 830.4 GFLOPS | |
13 (832 bytes) per thread | 370.7 GFLOPS | 741.2 GFLOPS | 671.3 GFLOPS | 812.3 GFLOPS | 801.4 GFLOPS | 826.5 GFLOPS | |
14 (896 bytes) per thread | 371.1 GFLOPS | 740.0 GFLOPS | 671.7 GFLOPS | 806.5 GFLOPS | 804.5 GFLOPS | 835.0 GFLOPS | |
15 (960 bytes) per thread | 370.6 GFLOPS | 740.6 GFLOPS | 671.1 GFLOPS | 805.0 GFLOPS | 804.6 GFLOPS | 831.8 GFLOPS | |
16 (1024 bytes) per thread | 369.1 GFLOPS | 737.8 GFLOPS | 816.6 GFLOPS | 808.5 GFLOPS | 807.5 GFLOPS | 818.8 GFLOPS |
M2: | Z Accumulators | 1 Thread | 2 Threads | 3 Threads | 4 Threads | 5 Threads | 6 Threads |
---|---|---|---|---|---|---|---|
1 (64 bytes) per thread | 51.2 GFLOPS | 82.3 GFLOPS | 123.4 GFLOPS | 157.6 GFLOPS | 197.4 GFLOPS | 234.4 GFLOPS | |
2 (128 bytes) per thread | 102.3 GFLOPS | 164.5 GFLOPS | 246.9 GFLOPS | 314.6 GFLOPS | 350.2 GFLOPS | 339.0 GFLOPS | |
3 (192 bytes) per thread | 153.5 GFLOPS | 246.8 GFLOPS | 360.2 GFLOPS | 380.8 GFLOPS | 434.3 GFLOPS | 424.7 GFLOPS | |
4 (256 bytes) per thread | 204.6 GFLOPS | 329.3 GFLOPS | 471.7 GFLOPS | 463.3 GFLOPS | 519.7 GFLOPS | 522.5 GFLOPS | |
5 (320 bytes) per thread | 255.8 GFLOPS | 411.0 GFLOPS | 557.3 GFLOPS | 524.1 GFLOPS | 568.5 GFLOPS | 572.6 GFLOPS | |
6 (384 bytes) per thread | 306.8 GFLOPS | 451.8 GFLOPS | 599.1 GFLOPS | 571.5 GFLOPS | 607.2 GFLOPS | 607.2 GFLOPS | |
7 (448 bytes) per thread | 358.2 GFLOPS | 493.7 GFLOPS | 601.1 GFLOPS | 580.0 GFLOPS | 591.6 GFLOPS | 610.4 GFLOPS | |
8 (512 bytes) per thread | 409.2 GFLOPS | 538.5 GFLOPS | 603.2 GFLOPS | 594.4 GFLOPS | 608.4 GFLOPS | 620.5 GFLOPS | |
9 (576 bytes) per thread | 408.9 GFLOPS | 540.7 GFLOPS | 605.6 GFLOPS | 583.0 GFLOPS | 604.4 GFLOPS | 617.9 GFLOPS | |
10 (640 bytes) per thread | 408.8 GFLOPS | 540.9 GFLOPS | 605.4 GFLOPS | 594.5 GFLOPS | 614.2 GFLOPS | 616.3 GFLOPS | |
11 (704 bytes) per thread | 409.1 GFLOPS | 553.3 GFLOPS | 614.4 GFLOPS | 603.7 GFLOPS | 606.4 GFLOPS | 614.8 GFLOPS | |
12 (768 bytes) per thread | 409.2 GFLOPS | 540.5 GFLOPS | 605.8 GFLOPS | 599.9 GFLOPS | 608.6 GFLOPS | 620.3 GFLOPS | |
13 (832 bytes) per thread | 409.4 GFLOPS | 550.2 GFLOPS | 614.5 GFLOPS | 594.7 GFLOPS | 606.0 GFLOPS | 608.0 GFLOPS | |
14 (896 bytes) per thread | 408.7 GFLOPS | 538.7 GFLOPS | 606.1 GFLOPS | 594.9 GFLOPS | 608.5 GFLOPS | 618.3 GFLOPS | |
15 (960 bytes) per thread | 409.1 GFLOPS | 551.0 GFLOPS | 614.5 GFLOPS | 594.3 GFLOPS | 615.0 GFLOPS | 607.6 GFLOPS | |
16 (1024 bytes) per thread | 408.8 GFLOPS | 541.0 GFLOPS | 605.3 GFLOPS | 594.7 GFLOPS | 608.9 GFLOPS | 621.0 GFLOPS |
M2, two at a time: | Z Accumulators | 1 Thread | 2 Threads | 3 Threads | 4 Threads | 5 Threads | 6 Threads |
---|---|---|---|---|---|---|---|
1 (128 bytes) per thread | 102.3 GFLOPS | 164.6 GFLOPS | 246.8 GFLOPS | 314.4 GFLOPS | 396.7 GFLOPS | 472.1 GFLOPS | |
2 (256 bytes) per thread | 204.6 GFLOPS | 329.3 GFLOPS | 493.8 GFLOPS | 632.7 GFLOPS | 696.4 GFLOPS | 677.2 GFLOPS | |
3 (384 bytes) per thread | 306.5 GFLOPS | 493.9 GFLOPS | 626.3 GFLOPS | 680.0 GFLOPS | 710.9 GFLOPS | 702.7 GFLOPS | |
4 (512 bytes) per thread | 409.0 GFLOPS | 657.8 GFLOPS | 701.5 GFLOPS | 702.0 GFLOPS | 712.3 GFLOPS | 712.9 GFLOPS | |
5 (640 bytes) per thread | 409.2 GFLOPS | 658.7 GFLOPS | 702.3 GFLOPS | 708.4 GFLOPS | 712.3 GFLOPS | 714.1 GFLOPS | |
6 (768 bytes) per thread | 409.2 GFLOPS | 658.1 GFLOPS | 702.8 GFLOPS | 701.5 GFLOPS | 712.4 GFLOPS | 698.1 GFLOPS | |
7 (896 bytes) per thread | 409.3 GFLOPS | 658.4 GFLOPS | 702.8 GFLOPS | 710.9 GFLOPS | 712.2 GFLOPS | 709.4 GFLOPS | |
8 (1024 bytes) per thread | 408.9 GFLOPS | 658.3 GFLOPS | 702.6 GFLOPS | 700.4 GFLOPS | 712.4 GFLOPS | 712.6 GFLOPS | |
9 (1152 bytes) per thread | 409.1 GFLOPS | 658.4 GFLOPS | 702.8 GFLOPS | 701.3 GFLOPS | 712.2 GFLOPS | 698.1 GFLOPS | |
10 (1280 bytes) per thread | 409.1 GFLOPS | 658.5 GFLOPS | 702.7 GFLOPS | 701.4 GFLOPS | 712.3 GFLOPS | 697.7 GFLOPS | |
11 (1408 bytes) per thread | 409.4 GFLOPS | 658.6 GFLOPS | 702.5 GFLOPS | 702.0 GFLOPS | 712.1 GFLOPS | 698.6 GFLOPS | |
12 (1536 bytes) per thread | 409.2 GFLOPS | 658.4 GFLOPS | 702.7 GFLOPS | 704.1 GFLOPS | 712.1 GFLOPS | 697.9 GFLOPS | |
13 (1664 bytes) per thread | 409.2 GFLOPS | 657.2 GFLOPS | 702.6 GFLOPS | 701.9 GFLOPS | 712.3 GFLOPS | 698.7 GFLOPS | |
14 (1792 bytes) per thread | 409.1 GFLOPS | 658.3 GFLOPS | 702.2 GFLOPS | 711.5 GFLOPS | 712.0 GFLOPS | 710.3 GFLOPS | |
15 (1920 bytes) per thread | 409.0 GFLOPS | 657.4 GFLOPS | 702.4 GFLOPS | 701.7 GFLOPS | 712.3 GFLOPS | 714.4 GFLOPS | |
16 (2048 bytes) per thread | 409.0 GFLOPS | 658.2 GFLOPS | 702.7 GFLOPS | 707.3 GFLOPS | 707.0 GFLOPS | 715.4 GFLOPS |
M2, four at a time: | Z Accumulators | 1 Thread | 2 Threads | 3 Threads | 4 Threads | 5 Threads | 6 Threads |
---|---|---|---|---|---|---|---|
1 (256 bytes) per thread | 204.6 GFLOPS | 329.1 GFLOPS | 493.7 GFLOPS | 403.3 GFLOPS | 502.0 GFLOPS | 608.1 GFLOPS | |
2 (512 bytes) per thread | 409.1 GFLOPS | 657.9 GFLOPS | 702.7 GFLOPS | 516.3 GFLOPS | 636.5 GFLOPS | 629.4 GFLOPS | |
3 (768 bytes) per thread | 409.2 GFLOPS | 658.1 GFLOPS | 702.8 GFLOPS | 510.1 GFLOPS | 652.8 GFLOPS | 642.5 GFLOPS | |
4 (1024 bytes) per thread | 409.3 GFLOPS | 658.3 GFLOPS | 702.3 GFLOPS | 504.4 GFLOPS | 652.8 GFLOPS | 644.4 GFLOPS | |
5 (1280 bytes) per thread | 409.1 GFLOPS | 658.4 GFLOPS | 702.6 GFLOPS | 515.9 GFLOPS | 653.2 GFLOPS | 648.9 GFLOPS | |
6 (1536 bytes) per thread | 409.0 GFLOPS | 658.4 GFLOPS | 702.6 GFLOPS | 516.0 GFLOPS | 652.1 GFLOPS | 642.9 GFLOPS | |
7 (1792 bytes) per thread | 409.2 GFLOPS | 658.2 GFLOPS | 702.5 GFLOPS | 510.2 GFLOPS | 466.7 GFLOPS | 643.1 GFLOPS | |
8 (2048 bytes) per thread | 409.1 GFLOPS | 658.1 GFLOPS | 702.2 GFLOPS | 516.1 GFLOPS | 651.8 GFLOPS | 643.0 GFLOPS | |
9 (2304 bytes) per thread | 409.3 GFLOPS | 657.7 GFLOPS | 702.2 GFLOPS | 501.7 GFLOPS | 619.4 GFLOPS | 646.4 GFLOPS | |
10 (2560 bytes) per thread | 409.2 GFLOPS | 658.7 GFLOPS | 702.8 GFLOPS | 516.2 GFLOPS | 652.1 GFLOPS | 635.1 GFLOPS | |
11 (2816 bytes) per thread | 409.3 GFLOPS | 650.2 GFLOPS | 702.6 GFLOPS | 504.3 GFLOPS | 652.9 GFLOPS | 638.9 GFLOPS | |
12 (3072 bytes) per thread | 409.0 GFLOPS | 658.4 GFLOPS | 701.7 GFLOPS | 515.3 GFLOPS | 653.2 GFLOPS | 643.3 GFLOPS | |
13 (3328 bytes) per thread | 409.2 GFLOPS | 650.1 GFLOPS | 702.6 GFLOPS | 516.2 GFLOPS | 652.5 GFLOPS | 636.3 GFLOPS | |
14 (3584 bytes) per thread | 409.3 GFLOPS | 649.5 GFLOPS | 703.0 GFLOPS | 516.0 GFLOPS | 652.6 GFLOPS | 627.6 GFLOPS | |
15 (3840 bytes) per thread | 409.4 GFLOPS | 658.4 GFLOPS | 702.8 GFLOPS | 516.2 GFLOPS | 652.6 GFLOPS | 640.9 GFLOPS | |
16 (4096 bytes) per thread | 409.1 GFLOPS | 658.3 GFLOPS | 702.9 GFLOPS | 504.0 GFLOPS | 652.5 GFLOPS | 638.1 GFLOPS |
BF16xBF16=BF16, M2: | Z Accumulators | 1 Thread | 2 Threads | 3 Threads | 4 Threads | 5 Threads | 6 Threads |
---|---|---|---|---|---|---|---|
1 (64 bytes) per thread | 51.2 GFLOPS | 82.3 GFLOPS | 123.5 GFLOPS | 157.6 GFLOPS | 197.1 GFLOPS | 236.2 GFLOPS | |
2 (128 bytes) per thread | 102.3 GFLOPS | 164.6 GFLOPS | 246.7 GFLOPS | 314.0 GFLOPS | 348.5 GFLOPS | 337.2 GFLOPS | |
3 (192 bytes) per thread | 153.3 GFLOPS | 246.8 GFLOPS | 358.2 GFLOPS | 380.4 GFLOPS | 431.5 GFLOPS | 426.8 GFLOPS | |
4 (256 bytes) per thread | 204.5 GFLOPS | 329.2 GFLOPS | 473.3 GFLOPS | 464.6 GFLOPS | 516.2 GFLOPS | 532.0 GFLOPS | |
5 (320 bytes) per thread | 255.3 GFLOPS | 410.9 GFLOPS | 558.2 GFLOPS | 528.9 GFLOPS | 570.6 GFLOPS | 572.0 GFLOPS | |
6 (384 bytes) per thread | 306.8 GFLOPS | 452.0 GFLOPS | 599.3 GFLOPS | 572.4 GFLOPS | 605.5 GFLOPS | 607.1 GFLOPS | |
7 (448 bytes) per thread | 357.9 GFLOPS | 494.2 GFLOPS | 601.6 GFLOPS | 579.4 GFLOPS | 601.6 GFLOPS | 613.0 GFLOPS | |
8 (512 bytes) per thread | 409.4 GFLOPS | 538.6 GFLOPS | 602.6 GFLOPS | 594.5 GFLOPS | 617.9 GFLOPS | 616.6 GFLOPS | |
9 (576 bytes) per thread | 409.2 GFLOPS | 540.3 GFLOPS | 606.1 GFLOPS | 600.5 GFLOPS | 604.3 GFLOPS | 605.5 GFLOPS | |
10 (640 bytes) per thread | 408.9 GFLOPS | 539.8 GFLOPS | 605.7 GFLOPS | 594.9 GFLOPS | 608.8 GFLOPS | 611.5 GFLOPS | |
11 (704 bytes) per thread | 408.7 GFLOPS | 553.3 GFLOPS | 614.7 GFLOPS | 595.3 GFLOPS | 606.2 GFLOPS | 618.3 GFLOPS | |
12 (768 bytes) per thread | 409.2 GFLOPS | 540.9 GFLOPS | 605.6 GFLOPS | 598.7 GFLOPS | 611.0 GFLOPS | 608.8 GFLOPS | |
13 (832 bytes) per thread | 409.2 GFLOPS | 550.6 GFLOPS | 614.4 GFLOPS | 599.6 GFLOPS | 611.2 GFLOPS | 608.7 GFLOPS | |
14 (896 bytes) per thread | 409.4 GFLOPS | 540.5 GFLOPS | 606.1 GFLOPS | 594.9 GFLOPS | 608.4 GFLOPS | 612.6 GFLOPS | |
15 (960 bytes) per thread | 408.7 GFLOPS | 551.0 GFLOPS | 614.7 GFLOPS | 593.0 GFLOPS | 607.4 GFLOPS | 607.5 GFLOPS | |
16 (1024 bytes) per thread | 409.0 GFLOPS | 540.6 GFLOPS | 605.6 GFLOPS | 594.6 GFLOPS | 616.6 GFLOPS | 608.4 GFLOPS |
Two at a time: | Z Accumulators | 1 Thread | 2 Threads | 3 Threads | 4 Threads | 5 Threads | 6 Threads |
---|---|---|---|---|---|---|---|
1 (128 bytes) per thread | 102.3 GFLOPS | 164.6 GFLOPS | 246.9 GFLOPS | 315.3 GFLOPS | 392.8 GFLOPS | 468.2 GFLOPS | |
2 (256 bytes) per thread | 204.6 GFLOPS | 329.1 GFLOPS | 493.9 GFLOPS | 629.2 GFLOPS | 691.9 GFLOPS | 681.5 GFLOPS | |
3 (384 bytes) per thread | 306.9 GFLOPS | 493.3 GFLOPS | 626.1 GFLOPS | 677.4 GFLOPS | 711.0 GFLOPS | 699.5 GFLOPS | |
4 (512 bytes) per thread | 409.5 GFLOPS | 658.3 GFLOPS | 702.9 GFLOPS | 707.9 GFLOPS | 712.2 GFLOPS | 697.7 GFLOPS | |
5 (640 bytes) per thread | 409.3 GFLOPS | 657.7 GFLOPS | 702.5 GFLOPS | 710.4 GFLOPS | 712.1 GFLOPS | 708.3 GFLOPS | |
6 (768 bytes) per thread | 409.2 GFLOPS | 657.7 GFLOPS | 702.5 GFLOPS | 702.1 GFLOPS | 712.2 GFLOPS | 697.2 GFLOPS | |
7 (896 bytes) per thread | 409.0 GFLOPS | 658.3 GFLOPS | 702.6 GFLOPS | 705.6 GFLOPS | 712.2 GFLOPS | 712.9 GFLOPS | |
8 (1024 bytes) per thread | 409.1 GFLOPS | 657.3 GFLOPS | 702.3 GFLOPS | 701.8 GFLOPS | 712.0 GFLOPS | 697.7 GFLOPS | |
9 (1152 bytes) per thread | 409.1 GFLOPS | 658.4 GFLOPS | 702.7 GFLOPS | 701.5 GFLOPS | 712.1 GFLOPS | 697.5 GFLOPS | |
10 (1280 bytes) per thread | 409.0 GFLOPS | 657.5 GFLOPS | 702.7 GFLOPS | 711.4 GFLOPS | 712.3 GFLOPS | 713.0 GFLOPS | |
11 (1408 bytes) per thread | 409.0 GFLOPS | 658.5 GFLOPS | 702.5 GFLOPS | 701.5 GFLOPS | 712.4 GFLOPS | 714.4 GFLOPS | |
12 (1536 bytes) per thread | 409.8 GFLOPS | 657.8 GFLOPS | 702.9 GFLOPS | 702.0 GFLOPS | 712.2 GFLOPS | 696.9 GFLOPS | |
13 (1664 bytes) per thread | 409.1 GFLOPS | 658.5 GFLOPS | 701.4 GFLOPS | 702.0 GFLOPS | 712.3 GFLOPS | 698.0 GFLOPS | |
14 (1792 bytes) per thread | 409.1 GFLOPS | 657.6 GFLOPS | 702.9 GFLOPS | 701.3 GFLOPS | 712.3 GFLOPS | 709.1 GFLOPS | |
15 (1920 bytes) per thread | 409.2 GFLOPS | 658.5 GFLOPS | 702.8 GFLOPS | 707.6 GFLOPS | 712.0 GFLOPS | 711.8 GFLOPS | |
16 (2048 bytes) per thread | 409.1 GFLOPS | 658.5 GFLOPS | 702.7 GFLOPS | 701.6 GFLOPS | 712.4 GFLOPS | 708.5 GFLOPS |
Four at a time: | Z Accumulators | 1 Thread | 2 Threads | 3 Threads | 4 Threads | 5 Threads | 6 Threads |
---|---|---|---|---|---|---|---|
1 (256 bytes) per thread | 204.5 GFLOPS | 330.6 GFLOPS | 494.1 GFLOPS | 403.3 GFLOPS | 502.1 GFLOPS | 604.4 GFLOPS | |
2 (512 bytes) per thread | 409.2 GFLOPS | 649.7 GFLOPS | 702.5 GFLOPS | 508.3 GFLOPS | 652.9 GFLOPS | 638.7 GFLOPS | |
3 (768 bytes) per thread | 408.8 GFLOPS | 658.3 GFLOPS | 702.6 GFLOPS | 510.1 GFLOPS | 630.0 GFLOPS | 655.2 GFLOPS | |
4 (1024 bytes) per thread | 409.2 GFLOPS | 658.3 GFLOPS | 702.6 GFLOPS | 515.7 GFLOPS | 637.4 GFLOPS | 634.8 GFLOPS | |
5 (1280 bytes) per thread | 409.2 GFLOPS | 658.0 GFLOPS | 702.9 GFLOPS | 507.9 GFLOPS | 651.6 GFLOPS | 643.5 GFLOPS | |
6 (1536 bytes) per thread | 409.3 GFLOPS | 657.4 GFLOPS | 702.7 GFLOPS | 513.9 GFLOPS | 641.4 GFLOPS | 631.0 GFLOPS | |
7 (1792 bytes) per thread | 409.2 GFLOPS | 649.3 GFLOPS | 702.8 GFLOPS | 505.2 GFLOPS | 652.3 GFLOPS | 634.2 GFLOPS | |
8 (2048 bytes) per thread | 409.1 GFLOPS | 657.8 GFLOPS | 702.4 GFLOPS | 516.0 GFLOPS | 629.6 GFLOPS | 655.2 GFLOPS | |
9 (2304 bytes) per thread | 409.2 GFLOPS | 658.0 GFLOPS | 702.3 GFLOPS | 509.5 GFLOPS | 652.2 GFLOPS | 639.5 GFLOPS | |
10 (2560 bytes) per thread | 409.1 GFLOPS | 658.2 GFLOPS | 702.7 GFLOPS | 507.3 GFLOPS | 652.0 GFLOPS | 646.9 GFLOPS | |
11 (2816 bytes) per thread | 409.2 GFLOPS | 657.9 GFLOPS | 702.6 GFLOPS | 508.8 GFLOPS | 651.9 GFLOPS | 637.6 GFLOPS | |
12 (3072 bytes) per thread | 409.0 GFLOPS | 650.2 GFLOPS | 702.6 GFLOPS | 516.0 GFLOPS | 653.0 GFLOPS | 623.4 GFLOPS | |
13 (3328 bytes) per thread | 409.2 GFLOPS | 658.7 GFLOPS | 702.7 GFLOPS | 515.3 GFLOPS | 652.6 GFLOPS | 637.0 GFLOPS | |
14 (3584 bytes) per thread | 409.6 GFLOPS | 657.9 GFLOPS | 702.8 GFLOPS | 537.2 GFLOPS | 622.0 GFLOPS | 648.5 GFLOPS | |
15 (3840 bytes) per thread | 409.4 GFLOPS | 657.7 GFLOPS | 702.8 GFLOPS | 515.8 GFLOPS | 653.2 GFLOPS | 632.6 GFLOPS | |
16 (4096 bytes) per thread | 409.2 GFLOPS | 657.9 GFLOPS | 702.9 GFLOPS | 516.1 GFLOPS | 652.9 GFLOPS | 634.1 GFLOPS |
Conclusions from all that:
Thanks for the data! I guess that if you can make FP32 become the bottleneck in calculations so much that you consider BF16, it's best to just use GPU instead of AMX. I realized that GPT-4 can help me solve GPU FP64 emulation, so there's less need to use the AMX.
I am curious about performance of interleaved complex multiplication. M2 can oversubscribe the AMX without changing maximum FLOPS. Could your benchmarks test a small sequence of instructions that reads the interleaved numbers from memory and tries to achieve maximum FLOPS?* I'll still test Accelerate BLAS but this would provide a more direct theoretical benchmark. Apple has to have provided some kind of real-world improvement from this ISA change. Maybe it's fixing underutilization during complex multiplication.
*My hypothesis: M1 Max should never exceed ~37.5% theoretical FLOPS, while M2 should reach ~75% maximum FLOPS.
Also disappointing: AMX vector throughput is less than CPU NEON vector throughput. Perhaps that's why Apple's BLAS library consistently underperforms OpenBLAS by a factor of two. Instead of using the NEON units in a multithreaded setting, the CPUs all fight for the same AMX block with less theoretical FLOPS. The GPU would not have this limitation; its theoretical vector FLOPS actually > its theoretical matrix FLOPS.
For my purposes, I have the following FP64 throughputs:
The takeaway: when using any accelerator, your vector FP64 throughput is going to decrease. By approximately a factor of 2. The AMX is not better than the GPU in this regard. It would mostly help in strange instances of multiplying two FP64 matrices. I recall in the 2-stage eigendecomp. algorithm by Dongarra, it was technically O(n^3) computational complexity. But that's because it's ~n layers of O(n^2) computations. There would be little opportunity to multiply two massive matrices even with the bulge-chasing stage. This principle probably also applies to the rest of linear algebra - why OpenBLAS is faster than Accelerate for LU decomposition, or anything besides GEMM.
Apple has to have provided some kind of real-world improvement from this ISA change.
It looks like four-at-a-time gets (up to) double the throughput when any broadcast mode other than mode 0 is used (provided you're not bottlenecked on Z accumulators). This suggests another bottleneck in the equations: bandwidth out of the (seemingly combined) X/Y register file. Mode 0 requires two loads from the register file per iteration, whereas the other modes need two loads on the first iteration but can then get away with only one load per iteration for subsequent iterations.
As a concrete example, vecfp F32xF32=F32 four-at-a-time mode 0: | Z Accumulators | 1 Thread | 2 Threads | 3 Threads | 4 Threads | 5 Threads | 6 Threads |
---|---|---|---|---|---|---|---|
1 (256 bytes) per thread | 102.5 GFLOPS | 164.8 GFLOPS | 247.1 GFLOPS | 286.1 GFLOPS | 264.1 GFLOPS | 298.3 GFLOPS | |
2 (512 bytes) per thread | 204.9 GFLOPS | 329.5 GFLOPS | 351.2 GFLOPS | 206.3 GFLOPS | 326.6 GFLOPS | 324.6 GFLOPS | |
3 (768 bytes) per thread | 205.0 GFLOPS | 329.3 GFLOPS | 351.4 GFLOPS | 211.5 GFLOPS | 273.0 GFLOPS | 327.9 GFLOPS | |
4 (1024 bytes) per thread | 205.0 GFLOPS | 327.1 GFLOPS | 351.5 GFLOPS | 205.9 GFLOPS | 323.3 GFLOPS | 324.2 GFLOPS | |
5 (1280 bytes) per thread | 204.9 GFLOPS | 329.5 GFLOPS | 351.3 GFLOPS | 206.1 GFLOPS | 326.7 GFLOPS | 316.7 GFLOPS | |
6 (1536 bytes) per thread | 204.9 GFLOPS | 329.5 GFLOPS | 351.5 GFLOPS | 208.1 GFLOPS | 326.5 GFLOPS | 325.3 GFLOPS | |
7 (1792 bytes) per thread | 204.9 GFLOPS | 328.6 GFLOPS | 351.5 GFLOPS | 208.1 GFLOPS | 326.5 GFLOPS | 325.1 GFLOPS | |
8 (2048 bytes) per thread | 205.0 GFLOPS | 327.2 GFLOPS | 351.5 GFLOPS | 206.1 GFLOPS | 320.9 GFLOPS | 324.5 GFLOPS | |
9 (2304 bytes) per thread | 205.0 GFLOPS | 329.5 GFLOPS | 351.5 GFLOPS | 209.2 GFLOPS | 318.4 GFLOPS | 325.3 GFLOPS | |
10 (2560 bytes) per thread | 205.0 GFLOPS | 329.5 GFLOPS | 351.5 GFLOPS | 205.4 GFLOPS | 322.5 GFLOPS | 325.1 GFLOPS | |
11 (2816 bytes) per thread | 205.0 GFLOPS | 329.4 GFLOPS | 351.4 GFLOPS | 206.7 GFLOPS | 326.6 GFLOPS | 326.9 GFLOPS | |
12 (3072 bytes) per thread | 204.9 GFLOPS | 327.2 GFLOPS | 351.4 GFLOPS | 208.1 GFLOPS | 323.8 GFLOPS | 327.9 GFLOPS | |
13 (3328 bytes) per thread | 204.9 GFLOPS | 329.4 GFLOPS | 351.5 GFLOPS | 205.6 GFLOPS | 326.6 GFLOPS | 326.6 GFLOPS | |
14 (3584 bytes) per thread | 205.0 GFLOPS | 327.2 GFLOPS | 351.5 GFLOPS | 205.8 GFLOPS | 326.6 GFLOPS | 324.9 GFLOPS | |
15 (3840 bytes) per thread | 205.0 GFLOPS | 329.4 GFLOPS | 351.3 GFLOPS | 206.4 GFLOPS | 325.6 GFLOPS | 323.4 GFLOPS | |
16 (4096 bytes) per thread | 205.0 GFLOPS | 329.4 GFLOPS | 351.4 GFLOPS | 206.9 GFLOPS | 326.5 GFLOPS | 325.6 GFLOPS |
Versus any other broadcast mode: | Z Accumulators | 1 Thread | 2 Threads | 3 Threads | 4 Threads | 5 Threads | 6 Threads |
---|---|---|---|---|---|---|---|
1 (256 bytes) per thread | 102.5 GFLOPS | 164.7 GFLOPS | 247.1 GFLOPS | 286.8 GFLOPS | 357.7 GFLOPS | 368.6 GFLOPS | |
2 (512 bytes) per thread | 205.0 GFLOPS | 329.5 GFLOPS | 494.3 GFLOPS | 464.8 GFLOPS | 502.7 GFLOPS | 540.1 GFLOPS | |
3 (768 bytes) per thread | 307.4 GFLOPS | 410.9 GFLOPS | 528.5 GFLOPS | 505.9 GFLOPS | 530.3 GFLOPS | 549.7 GFLOPS | |
4 (1024 bytes) per thread | 409.9 GFLOPS | 505.2 GFLOPS | 548.0 GFLOPS | 541.9 GFLOPS | 551.1 GFLOPS | 559.2 GFLOPS | |
5 (1280 bytes) per thread | 409.7 GFLOPS | 505.3 GFLOPS | 547.7 GFLOPS | 541.9 GFLOPS | 554.1 GFLOPS | 553.6 GFLOPS | |
6 (1536 bytes) per thread | 409.8 GFLOPS | 505.3 GFLOPS | 547.9 GFLOPS | 542.0 GFLOPS | 550.2 GFLOPS | 559.3 GFLOPS | |
7 (1792 bytes) per thread | 409.9 GFLOPS | 505.0 GFLOPS | 547.8 GFLOPS | 541.9 GFLOPS | 550.4 GFLOPS | 559.5 GFLOPS | |
8 (2048 bytes) per thread | 409.8 GFLOPS | 505.3 GFLOPS | 547.8 GFLOPS | 542.1 GFLOPS | 550.5 GFLOPS | 559.4 GFLOPS | |
9 (2304 bytes) per thread | 409.9 GFLOPS | 505.5 GFLOPS | 547.1 GFLOPS | 541.8 GFLOPS | 550.5 GFLOPS | 554.9 GFLOPS | |
10 (2560 bytes) per thread | 409.9 GFLOPS | 505.3 GFLOPS | 548.1 GFLOPS | 541.8 GFLOPS | 550.4 GFLOPS | 559.3 GFLOPS | |
11 (2816 bytes) per thread | 409.9 GFLOPS | 505.4 GFLOPS | 547.8 GFLOPS | 540.9 GFLOPS | 545.5 GFLOPS | 557.9 GFLOPS | |
12 (3072 bytes) per thread | 409.9 GFLOPS | 505.2 GFLOPS | 548.1 GFLOPS | 542.0 GFLOPS | 550.4 GFLOPS | 559.1 GFLOPS | |
13 (3328 bytes) per thread | 409.9 GFLOPS | 505.4 GFLOPS | 547.8 GFLOPS | 542.6 GFLOPS | 550.4 GFLOPS | 549.7 GFLOPS | |
14 (3584 bytes) per thread | 409.8 GFLOPS | 505.4 GFLOPS | 547.9 GFLOPS | 545.1 GFLOPS | 550.4 GFLOPS | 559.0 GFLOPS | |
15 (3840 bytes) per thread | 410.0 GFLOPS | 505.3 GFLOPS | 547.8 GFLOPS | 544.8 GFLOPS | 550.4 GFLOPS | 558.7 GFLOPS | |
16 (4096 bytes) per thread | 409.8 GFLOPS | 505.2 GFLOPS | 547.9 GFLOPS | 541.9 GFLOPS | 550.5 GFLOPS | 555.1 GFLOPS |
After analyzing the die shots and speculating on performance, I came across a major change to the AMX architecture. Would you mind reading through the README to amx-benchmarks and helping me test the hypothesis? You don't need to rent an M2 from the cloud; I can test on my A15.
permalink for the hypothesis in question