Closed nishshah0 closed 8 months ago
MFMA FLOPs (BF16) leverages SQ_INSTS_VALU_MFMA_MOPS_BF16
. Its value is computed using
AVG((SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) / (EndNs - BeginNs))
@arghdos may be able to provide a more detailed description of the counter
The 512 BF16 flops/instruction value in the MI-200 equation appears to be incorrect, at least according to: https://gpuopen.com/learn/amd-lab-notes/amd-lab-notes-matrix-cores-readme/, but we'll need to double check.
That said @shaw586, assuming you did a single BF16 operation is probably incorrect.
You need to take the value of MFMA-BF16
in the MFMA Arithmetic Instr Mix
and multiply that by the number of waves launched to get the total number of BF16 operations.
Then take the total number of BF16 ops, and multiply by the FLOP/OP count (seemingly, 1024 on MI-200) to get total FLOPS then normalize by time.
@coleramos425 -- I noticed a separate issue where the values in the above section can only be normalized by the # of waves:
Can you open something to track?
This the instruction mix statistics
This the wavefront statistics
What metric do I multiply MFMA-BF16 with from wavefront statistics?
2 MFMA-BF16 ops / wave 4 wavefronts 1024 FLOPs / BF16 op = 8192 BF16 FLOPs
8192/14880 = 0.55 Gflops. The speed of light show 4.4 Gflops. Which seams exactly 8x higher. How is That metric in speed of light calculated?
How is That metric in speed of light calculated?
Not sure what's happening here, if you can share the pmc_perf.csv
file that's being generated in workloads/<project name>/mi200
perhaps @coleramos425 can take a look
Attaching the pmc_perf.csv pmc_perf.csv
This is the command I used to run
omniperf profile -n gemm_m8_k8_n8 -d 1 --device 0 -- ./rocBLAS/build/release/clients/staging/rocblas-bench -m 8 -k 8 -n 8 -f gemm_ex -r bf16_r --compute_type f32_r -i 1 -j 1 --device 0
This dispatch has a duration of 14880 ns and a SQ_INSTS_VALU_MFMA_MOPS_BF16
of 128
(SQ_INSTS_VALU_MFMA_MOPS_BF16 512) / (EndNs - BeginNs) = 128 512 / 14880 = 65536 / 14880 = 4.4 Gflops
@xiaomin-amd claims the arithmetic is correct. I'll forward to let him weigh in.
[AMD Official Use Only - General]
Please dump the matching disassembly for analysis.
From: Cole Ramos @.> Sent: Thursday, January 19, 2023 3:43 PM To: AMDResearch/omniperf @.> Cc: Lu, Xiaomin @.>; Mention @.> Subject: Re: [AMDResearch/omniperf] omniperf analyze statistics does not match understanding (Issue #66)
Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
This dispatch has a duration of 14880 ns (i.e. EndNs-BeginNs) and a SQ_INSTS_VALU_MFMA_MOPS_BF16 of 128
(SQ_INSTS_VALU_MFMA_MOPS_BF16 512) / (EndNs - BeginNs) = 128 512 / 14880 = 65536 / 14880 = 4.4 Gflops
@xiaomin-amdhttps://github.com/xiaomin-amd claims the arithmetic is correct. I'll forward to let him weigh in.
- Reply to this email directly, view it on GitHubhttps://github.com/AMDResearch/omniperf/issues/66#issuecomment-1397645897, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AUQ7Y2WDXQ5247PMK6NL6BLWTGYODANCNFSM6AAAAAAT2V6YAQ. You are receiving this because you were mentioned.Message ID: @.***>
Can someone please educate me how to dump disassembly? I am using omniprofile with rocblas-bench.
Also its really surprising to see a 8x8x8 gemm executing 64k flops. Thats 64x higher than what I am specifiying!
I am sorry, that readme specifies how I can generate objdump of compiled application where I have executable. But how do I use this with rocblas-bench?
It sorta depends on how rocblas-bench is set up. If the kernels are compiled into rocblas-bench, you should be able to do a roc-obj -d /path/to/rocblas/bench to get the ISA for all the kernels there.
It is possible they're loaded at runtime though from another library, in which case you can use the same command on that library, or the HIP environment variable export GPU_DUMP_CODE_OBJECT=1
to dump the kernels you use at runtime.
I following following steps,
export GPU_DUMP_CODE_OBJECT=1
./rocBLAS/build/release/clients/staging/rocblas-bench -m 8 -k 8 -n 8 -f gemm_ex -r bf16_r --compute_type f32_r -i 1 -j 1 --device 0
This resulted in generation of bunch of object files.
When I open them with command
roc-obj -d _code_object0000.o
I get following error
Error: No kernel section found
error: no executables specified
ping! Any further directions?
Not sure how much help I can provide outside of pointing you to the README for roc-obj https://github.com/ROCm-Developer-Tools/HIP/blob/develop/docs/markdown/obj_tooling.md
Looks to me like -d
is supposed to take an executable, no?
@shaw586 please update the thread if you are still actively attempting to generate a dump. Otherwise, I will archive this issue.
Thank you.
Yes, I am trying to generate the dump. Been bouncing different places to try and get some help. Havent found my luck yet. Currently waiting on rocBLAS team to respond. If you can help find someone who knows how to dump assembly while using rocBLAS, I would appreciate it!
Here is the assembly dump of the kernel being executed, TensileLibrary_Type_BB_HPA_Contraction_l_Ailk_Bljk_Cijk_Dijk_gfx90a_m8_k8_n8_objdump.zip
Closing due to inactivity. Please re-open if you still have questions.
I have using using omniperf to analyze some of the applications. I ran a simple 8x8x8 gemm in BF16 data format using following command line
omniperf profile -n gemm_m8_k8_n8] -d 1 --device 0 -- ./rocBLAS/build/release/clients/staging/rocblas-bench -m 8 -k 8 -n 8 -f gemm_ex -r bf16_r --compute_type f32_r -i 1 -j 1 --device 0
after running
omniperf analyze -p gemm_m8_k8_n8
I get following outputThe highlighted metric MFMA Flops (BF16) does not make sense. I expect 8x8x8x2 = 1024 flops.
Kernel takes 14.8 us, look below
So I expect 1024/(14.8 * 1e-6) = 69.2 Million Flops ~ 0.068 GFLOPS.
But I see 4.4 Gflops. How is this calculated?