fix max BF16 flop rate on CDNA2

From the AMD Matrix Calculator, some BF16 ops get up to 1024 FLOPs/CU/Cycle, e.g.:

$ ./matrix_calculator.py --architecture CDNA2 --instruction v_mfma_f32_32x32x8bf16_1k --detail-instruction
Architecture: CDNA2
Instruction: V_MFMA_F32_32X32X8BF16_1K
    Encoding: VOP3P-MAI
    VOP3P Opcode: 0x66
    VOP3P-MAI Opcode: 0x26
    Matrix Dimensions:
        M: 32
        N: 32
        K: 8
        blocks: 1
    Execution statistics:
        FLOPs: 16384
        Execution cycles: 64
        **FLOPs/CU/cycle: 1024**
        Can co-execute with VALU: True
        VALU co-execution cycles possible: 60

Therefore, our calculations of peak, which here assume a max of 512, are wrong. This can be verified experimentally with a simple (unoptimized)( kernel that spams these ops, e.g.:

After this fix, we get:

ROCm / rocprofiler-compute

fix max BF16 flop rate on CDNA2 #155