ROCm / rocprofiler-compute

Advanced Profiling and Analytics for AMD Hardware
https://rocm.docs.amd.com/projects/omniperf/en/latest/
MIT License
135 stars 49 forks source link

omniperf analyze statistics does not match understanding #66

Closed nishshah0 closed 8 months ago

nishshah0 commented 1 year ago

I have using using omniperf to analyze some of the applications. I ran a simple 8x8x8 gemm in BF16 data format using following command line omniperf profile -n gemm_m8_k8_n8] -d 1 --device 0 -- ./rocBLAS/build/release/clients/staging/rocblas-bench -m 8 -k 8 -n 8 -f gemm_ex -r bf16_r --compute_type f32_r -i 1 -j 1  --device 0

after running omniperf analyze -p gemm_m8_k8_n8 I get following output image

The highlighted metric MFMA Flops (BF16) does not make sense. I expect 8x8x8x2 = 1024 flops.

Kernel takes 14.8 us, look below image

So I expect 1024/(14.8 * 1e-6) = 69.2 Million Flops ~ 0.068 GFLOPS.

But I see 4.4 Gflops. How is this calculated?

coleramos425 commented 1 year ago

MFMA FLOPs (BF16) leverages SQ_INSTS_VALU_MFMA_MOPS_BF16. Its value is computed using

AVG((SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) / (EndNs - BeginNs))

@arghdos may be able to provide a more detailed description of the counter

skyreflectedinmirrors commented 1 year ago

The 512 BF16 flops/instruction value in the MI-200 equation appears to be incorrect, at least according to: https://gpuopen.com/learn/amd-lab-notes/amd-lab-notes-matrix-cores-readme/, but we'll need to double check.

That said @shaw586, assuming you did a single BF16 operation is probably incorrect. You need to take the value of MFMA-BF16 in the MFMA Arithmetic Instr Mix and multiply that by the number of waves launched to get the total number of BF16 operations. Then take the total number of BF16 ops, and multiply by the FLOP/OP count (seemingly, 1024 on MI-200) to get total FLOPS then normalize by time.

@coleramos425 -- I noticed a separate issue where the values in the above section can only be normalized by the # of waves:

https://github.com/AMDResearch/omniperf/blob/main/src/omniperf_analyze/configs/gfx90a/1000_compute-unit-instruction-mix.yaml#L171

Can you open something to track?

nishshah0 commented 1 year ago

This the instruction mix statistics image

This the wavefront statistics image

What metric do I multiply MFMA-BF16 with from wavefront statistics?

skyreflectedinmirrors commented 1 year ago

2 MFMA-BF16 ops / wave 4 wavefronts 1024 FLOPs / BF16 op = 8192 BF16 FLOPs

nishshah0 commented 1 year ago

8192/14880 = 0.55 Gflops. The speed of light show 4.4 Gflops. Which seams exactly 8x higher. How is That metric in speed of light calculated?

skyreflectedinmirrors commented 1 year ago

How is That metric in speed of light calculated?

https://github.com/AMDResearch/omniperf/blob/main/src/omniperf_analyze/configs/gfx90a/0200_system-speed-of-light.yaml#L45

Not sure what's happening here, if you can share the pmc_perf.csv file that's being generated in workloads/<project name>/mi200 perhaps @coleramos425 can take a look

nishshah0 commented 1 year ago

Attaching the pmc_perf.csv pmc_perf.csv

This is the command I used to run omniperf profile -n gemm_m8_k8_n8 -d 1 --device 0 -- ./rocBLAS/build/release/clients/staging/rocblas-bench -m 8 -k 8 -n 8 -f gemm_ex -r bf16_r --compute_type f32_r -i 1 -j 1 --device 0

coleramos425 commented 1 year ago

This dispatch has a duration of 14880 ns and a SQ_INSTS_VALU_MFMA_MOPS_BF16 of 128

(SQ_INSTS_VALU_MFMA_MOPS_BF16 512) / (EndNs - BeginNs) = 128 512 / 14880 = 65536 / 14880 = 4.4 Gflops

@xiaomin-amd claims the arithmetic is correct. I'll forward to let him weigh in.

xiaomin-amd commented 1 year ago

[AMD Official Use Only - General]

Please dump the matching disassembly for analysis.

From: Cole Ramos @.> Sent: Thursday, January 19, 2023 3:43 PM To: AMDResearch/omniperf @.> Cc: Lu, Xiaomin @.>; Mention @.> Subject: Re: [AMDResearch/omniperf] omniperf analyze statistics does not match understanding (Issue #66)

Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.

This dispatch has a duration of 14880 ns (i.e. EndNs-BeginNs) and a SQ_INSTS_VALU_MFMA_MOPS_BF16 of 128

(SQ_INSTS_VALU_MFMA_MOPS_BF16 512) / (EndNs - BeginNs) = 128 512 / 14880 = 65536 / 14880 = 4.4 Gflops

@xiaomin-amdhttps://github.com/xiaomin-amd claims the arithmetic is correct. I'll forward to let him weigh in.

- Reply to this email directly, view it on GitHubhttps://github.com/AMDResearch/omniperf/issues/66#issuecomment-1397645897, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AUQ7Y2WDXQ5247PMK6NL6BLWTGYODANCNFSM6AAAAAAT2V6YAQ. You are receiving this because you were mentioned.Message ID: @.***>

nishshah0 commented 1 year ago

Can someone please educate me how to dump disassembly? I am using omniprofile with rocblas-bench.

Also its really surprising to see a 8x8x8 gemm executing 64k flops. Thats 64x higher than what I am specifiying!

skyreflectedinmirrors commented 1 year ago

https://github.com/ROCm-Developer-Tools/HIP/blob/develop/docs/markdown/obj_tooling.md

nishshah0 commented 1 year ago

I am sorry, that readme specifies how I can generate objdump of compiled application where I have executable. But how do I use this with rocblas-bench?

skyreflectedinmirrors commented 1 year ago

It sorta depends on how rocblas-bench is set up. If the kernels are compiled into rocblas-bench, you should be able to do a roc-obj -d /path/to/rocblas/bench to get the ISA for all the kernels there.

It is possible they're loaded at runtime though from another library, in which case you can use the same command on that library, or the HIP environment variable export GPU_DUMP_CODE_OBJECT=1 to dump the kernels you use at runtime.

nishshah0 commented 1 year ago

I following following steps,

export GPU_DUMP_CODE_OBJECT=1
./rocBLAS/build/release/clients/staging/rocblas-bench -m 8 -k 8 -n 8 -f gemm_ex -r bf16_r --compute_type f32_r -i 1 -j 1 --device 0

This resulted in generation of bunch of object files. image

When I open them with command roc-obj -d _code_object0000.o

I get following error

Error: No kernel section found
error: no executables specified
nishshah0 commented 1 year ago

ping! Any further directions?

coleramos425 commented 1 year ago

Not sure how much help I can provide outside of pointing you to the README for roc-obj https://github.com/ROCm-Developer-Tools/HIP/blob/develop/docs/markdown/obj_tooling.md

Looks to me like -d is supposed to take an executable, no?

coleramos425 commented 1 year ago

@shaw586 please update the thread if you are still actively attempting to generate a dump. Otherwise, I will archive this issue.

Thank you.

nishshah0 commented 1 year ago

Yes, I am trying to generate the dump. Been bouncing different places to try and get some help. Havent found my luck yet. Currently waiting on rocBLAS team to respond. If you can help find someone who knows how to dump assembly while using rocBLAS, I would appreciate it!

nishshah0 commented 1 year ago

Here is the assembly dump of the kernel being executed, TensileLibrary_Type_BB_HPA_Contraction_l_Ailk_Bljk_Cijk_Dijk_gfx90a_m8_k8_n8_objdump.zip

coleramos425 commented 8 months ago

Closing due to inactivity. Please re-open if you still have questions.