Closed skyreflectedinmirrors closed 1 year ago
From the AMD Matrix Calculator, some BF16 ops get up to 1024 FLOPs/CU/Cycle, e.g.:
$ ./matrix_calculator.py --architecture CDNA2 --instruction v_mfma_f32_32x32x8bf16_1k --detail-instruction Architecture: CDNA2 Instruction: V_MFMA_F32_32X32X8BF16_1K Encoding: VOP3P-MAI VOP3P Opcode: 0x66 VOP3P-MAI Opcode: 0x26 Matrix Dimensions: M: 32 N: 32 K: 8 blocks: 1 Execution statistics: FLOPs: 16384 Execution cycles: 64 **FLOPs/CU/cycle: 1024** Can co-execute with VALU: True VALU co-execution cycles possible: 60
Therefore, our calculations of peak, which here assume a max of 512, are wrong. This can be verified experimentally with a simple (unoptimized)( kernel that spams these ops, e.g.:
After this fix, we get:
Looks good to me
From the AMD Matrix Calculator, some BF16 ops get up to 1024 FLOPs/CU/Cycle, e.g.:
Therefore, our calculations of peak, which here assume a max of 512, are wrong. This can be verified experimentally with a simple (unoptimized)( kernel that spams these ops, e.g.:
After this fix, we get: