krzysz00 commented 11 months ago

Requirement :

It's hard to debug rocMLIR issues right now because the trace variable is only effective during the final run, but not during the benchmarking/quick tuning stage, which is when we might want a look at the generated MLIR so that we can, for example, work out why shufflenet is crashing.

What we have today

When there is a alleged issue with rocMLIR, we just run with MIGRAPHX_TRACE_MLIR=1, then we get :


Benchmarking gpu::mlir_op: 36 configs
Fastest solution: 64,64,8,16,64,8,1,1

# bunch of migraphx module dumps

mlir_main:pointwise4:y1.0 = @param:y1.0 -> half_type, {1, 768, 768}, {589824, 768, 1}, target_id=0
mlir_main:pointwise4:x1 = @param:x1 -> half_type, {1, 384, 768}, {294912, 768, 1}, target_id=0
mlir_main:pointwise4:@2 = multibroadcast[out_lens={1, 768, 768},out_dyn_dims={}](mlir_main:pointwise4:y1.0) -> half_type, {1, 768, 768}, {589824, 768, 1}, target_id=0
mlir_main:pointwise4:y0 = @param:y0 -> half_type, {1, 12, 384, 64}, {294912, 24576, 64, 1}, target_id=0
mlir_main:pointwise4:@4 = transpose[permutation={0, 2, 1, 3}](mlir_main:pointwise4:y0) -> half_type, {1, 384, 12, 64}, {294912, 64, 24576, 1}, target_id=0
mlir_main:pointwise4:@5 = contiguous(mlir_main:pointwise4:@4) -> half_type, {1, 384, 12, 64}, {294912, 768, 64, 1}, target_id=0
mlir_main:pointwise4:@6 = reshape[dims={0, 0, 768}](mlir_main:pointwise4:@5) -> half_type, {1, 384, 768}, {294912, 768, 1}, target_id=0
mlir_main:pointwise4:@7 = contiguous(mlir_main:pointwise4:@2) -> half_type, {1, 768, 768}, {589824, 768, 1}, target_id=0
mlir_main:pointwise4:@8 = dot(mlir_main:pointwise4:@6,mlir_main:pointwise4:@7) -> half_type, {1, 384, 768}, {294912, 768, 1}, target_id=0
mlir_main:pointwise4:@9 = add(mlir_main:pointwise4:@8,mlir_main:pointwise4:x1) -> half_type, {1, 384, 768}, {294912, 768, 1}, target_id=0
mlir_main:pointwise4:@10 = @return(mlir_main:pointwise4:@9), target_id=0

# bunch of MLIR modules that correspond to above

module {
  func.func @mlir_transpose_reshape_dot_add(%arg0: tensor<1x384x768xf16>, %arg1: tensor<1x12x384x64xf16>, %arg2: tensor<1x768x768xf16>) -> tensor<1x384x768xf16> attributes {arch = "gfx90a:sramecc+:xnack-", kernel = "mixr", num_cu = 110 : i64} {
    %0 = migraphx.multibroadcast(%arg2) {out_dyn_dims = [], out_lens = [1, 768, 768]} : (tensor<1x768x768xf16>) -> tensor<1x768x768xf16>
    %1 = migraphx.transpose(%arg1) {permutation = [0, 2, 1, 3]} : (tensor<1x12x384x64xf16>) -> tensor<1x384x12x64xf16>
    %2 = migraphx.reshape(%1) {dims = [0, 0, 768]} : (tensor<1x384x12x64xf16>) -> tensor<1x384x768xf16>
    %3 = migraphx.dot(%2, %0) : (tensor<1x384x768xf16>, tensor<1x768x768xf16>) -> tensor<1x384x768xf16>
    %4 = migraphx.add(%3, %arg0) : (tensor<1x384x768xf16>, tensor<1x384x768xf16>) -> tensor<1x384x768xf16>
    return %4 : tensor<1x384x768xf16>
  }
}

Problems

P1) The benchmarking step is where the first compilation happens and if there is a failure, we cant get a dump before it is sent to MLIR.

P2)

Benchmarking gpu::mlir_op: 36 configs
Fastest solution: 64,64,8,16,64,8,1,1

If there is a runtime problem, we can figure out which migraphx / MLIR module caused this because compilation is done compiled_result just contain the code_object without a span to what produced that code object.

Definition of Done

Fix P1 : We need a dump /w MIGRAPHX_TRACE_MLIR=1 of migraphx module and MLIR module before it gets compiled for the first time.
Fix P2 : We need a another dump /w MIGRAPHX_TRACE_MLIR=1 before it is benchmarked along with the solution.

manupak commented 10 months ago

Updated the ticket @jerryyin @krzysz00

pfultz2 commented 10 months ago

Fix P2 : We need a another dump /w MIGRAPHX_TRACE_MLIR=1 before it is benchmarked along with the solution.

I dont think we can dump MLIR after its been compiled. The benchmarking only has the binaries.

What we can do is print out the solutions key in the mlir trace dumps, and then during benchmarking with something likeMIGRAPHX_TRACE_BENCHMARK=1, we can show the solution key it is trying to run. Then you can cross reference it to get the MLIR program.

manupak commented 10 months ago

I dont think we can dump MLIR after its been compiled. The benchmarking only has the binaries.

I dont understand why this is not possible. MIGraphX just need to carry a span reference back to the migraphx module until after the benchmarking -- probably in the compiled_result

What we can do is print out the solutions key in the mlir trace dumps, and then during benchmarking with something like MIGRAPHX_TRACE_BENCHMARK=1, we can show the solution key it is trying to run. Then you can cross reference it to get the MLIR program.

You are saying something like :

Benchmarking gpu::mlir_op: 36 configs
Problem : gfx90a:sramecc+:xnack-    110 -t f16 -out_datatype f16 -transA false -transB false -g 1 -m 384 -n 2304 -k 768
Solution : 128,256,4,64,128,8,1,1
Solution : 128,256,2,64,128,8,1,1

right ?

well this is not good enough because its a one to many map to MLIR programs starting the from that problem key ( + solution).

Anyhow, -- correct me @jerryyin if I am wrong --I think we are trying to get to a point for MIGraphX team to raise bugs (tickets) saying which MLIR program hangs/fails/slow rather than rocMLIR trying to debug the model failure starting the MIGraphX.

So,

Then you can cross reference it to get the MLIR program

If you think above is doable from MIGraphX team, I personally dont have issues; just that I dont see how one could do that when multiple (different) MLIR programs have the same problem key.

umangyadav commented 2 months ago

When you set MIGRAPHX_TRACE_BENCHMARKING=3 it would print MLIR module before benchmarking

https://github.com/ROCm/AMDMIGraphX/blob/3e65032266c013e840fd08a2f41623a822e1c538/src/targets/gpu/compile_ops.cpp#L207

https://github.com/ROCm/AMDMIGraphX/blob/3e65032266c013e840fd08a2f41623a822e1c538/src/targets/gpu/compile_ops.cpp#L86

https://github.com/ROCm/AMDMIGraphX/blob/3e65032266c013e840fd08a2f41623a822e1c538/src/targets/gpu/jit/mlir.cpp#L125

https://github.com/ROCm/AMDMIGraphX/blob/3e65032266c013e840fd08a2f41623a822e1c538/src/targets/gpu/jit/mlir.cpp#L207

umangyadav commented 2 months ago

For the split-k fusion logic part, it is not printing right now but that can be added easily https://github.com/ROCm/AMDMIGraphX/blob/3e65032266c013e840fd08a2f41623a822e1c538/src/targets/gpu/jit/mlir.cpp#L137

umangyadav commented 2 months ago

Fix P1 : We need a dump /w MIGRAPHX_TRACE_MLIR=1 of migraphx module and MLIR module before it gets compiled for the first time. Fix P2 : We need a another dump /w MIGRAPHX_TRACE_MLIR=1 before it is benchmarked along with the solution.

we already have Fix P1 i think. We have partial fix P2.

ROCm / AMDMIGraphX

MIGRAPHX_TRACE_MLIR should dump out the constructed MLIR before benchmarking / quick tuning #2332

Requirement :

What we have today

Problems

Definition of Done