ROCm / rocprofiler-compute

Advanced Profiling and Analytics for AMD Hardware
https://rocm.docs.amd.com/projects/omniperf/en/latest/
MIT License
135 stars 49 forks source link

fix max BF16 flop rate on CDNA2 #155

Closed skyreflectedinmirrors closed 1 year ago

skyreflectedinmirrors commented 1 year ago

From the AMD Matrix Calculator, some BF16 ops get up to 1024 FLOPs/CU/Cycle, e.g.:

$ ./matrix_calculator.py --architecture CDNA2 --instruction v_mfma_f32_32x32x8bf16_1k --detail-instruction
Architecture: CDNA2
Instruction: V_MFMA_F32_32X32X8BF16_1K
    Encoding: VOP3P-MAI
    VOP3P Opcode: 0x66
    VOP3P-MAI Opcode: 0x26
    Matrix Dimensions:
        M: 32
        N: 32
        K: 8
        blocks: 1
    Execution statistics:
        FLOPs: 16384
        Execution cycles: 64
        **FLOPs/CU/cycle: 1024**
        Can co-execute with VALU: True
        VALU co-execution cycles possible: 60

Therefore, our calculations of peak, which here assume a max of 512, are wrong. This can be verified experimentally with a simple (unoptimized)( kernel that spams these ops, e.g.:

image

After this fix, we get:

image

coleramos425 commented 1 year ago

Looks good to me