ROCm / AMDMIGraphX

AMD's graph optimization engine.
https://rocm.docs.amd.com/projects/AMDMIGraphX/en/latest/
MIT License
181 stars 82 forks source link

MI300: run_high_level_pipeline: Invalid MLIR created #3117

Open shivadbhavsar opened 3 months ago

shivadbhavsar commented 3 months ago

Error observed in TIMM torch benchmark model: eca_halonext26ts

Full message:

terminate called after throwing an instance of 'migraphx::version_2_10_0::exception'
  what():  /workspace/tm_benchmarks/AMDMIGraphX/src/targets/gpu/mlir.cpp:729: run_high_level_pipeline: Invalid MLIR created: Error: !migraphx.shaped type can't be laid out in memory when the stride 144 at index 1 does not evenly divide the previous stride 96
Error: failed to legalize operation 'func.func' that was explicitly marked illegal
Note: see current operation: 
"func.func"() <{function_type = (!migraphx.shaped<128x128x8x8xf16, 8192x64x8x1>, !migraphx.shaped<128x640x1x1x12x12xf16, 92160x144x96x8x12x1>) -> !migraphx.shaped<1024x1x64x144xf16, 9216x9216x144x1>, sym_name = "mlir_reshape_transpose_reshape_transpose_reshape_transpose_slice_transpose_dot"}> ({
^bb0(%arg0: !migraphx.shaped<128x128x8x8xf16, 8192x64x8x1>, %arg1: !migraphx.shaped<128x640x1x1x12x12xf16, 92160x144x96x8x12x1>):
  %0 = "migraphx.mlir.as.logical.shape"(%arg0) : (!migraphx.shaped<128x128x8x8xf16, 8192x64x8x1>) -> tensor<128x128x8x8xf16>
  %1 = "migraphx.mlir.as.logical.shape"(%arg1) : (!migraphx.shaped<128x640x1x1x12x12xf16, 92160x144x96x8x12x1>) -> tensor<128x640x1x1x12x12xf16>
  %2 = "tosa.reshape"(%0) <{new_shape = array<i64: 1024, 16, 1, 8, 1, 8>}> : (tensor<128x128x8x8xf16>) -> tensor<1024x16x1x8x1x8xf16>
  %3 = "tosa.const"() <{value = dense<[0, 1, 3, 5, 2, 4]> : tensor<6xi64>}> : () -> tensor<6xi64>
  %4 = "tosa.transpose"(%2, %3) : (tensor<1024x16x1x8x1x8xf16>, tensor<6xi64>) -> tensor<1024x16x8x8x1x1xf16>
  %5 = "tosa.reshape"(%4) <{new_shape = array<i64: 1024, 16, 64, 1>}> : (tensor<1024x16x8x8x1x1xf16>) -> tensor<1024x16x64x1xf16>
  %6 = "tosa.const"() <{value = dense<[0, 3, 2, 1]> : tensor<4xi64>}> : () -> tensor<4xi64>
  %7 = "tosa.transpose"(%5, %6) : (tensor<1024x16x64x1xf16>, tensor<4xi64>) -> tensor<1024x1x64x16xf16>
  %8 = "tosa.reshape"(%1) <{new_shape = array<i64: 1024, 80, 1, 144>}> : (tensor<128x640x1x1x12x12xf16>) -> tensor<1024x80x1x144xf16>
  %9 = "tosa.const"() <{value = dense<[0, 2, 3, 1]> : tensor<4xi64>}> : () -> tensor<4xi64>
  %10 = "tosa.transpose"(%8, %9) : (tensor<1024x80x1x144xf16>, tensor<4xi64>) -> tensor<1024x1x144x80xf16>
  %11 = "tosa.slice"(%10) <{size = array<i64: 1024, 1, 144, 16>, start = array<i64: 0, 0, 0, 0>}> : (tensor<1024x1x144x80xf16>) -> tensor<1024x1x144x16xf16>
  %12 = "tosa.const"() <{value = dense<[0, 1, 3, 2]> : tensor<4xi64>}> : () -> tensor<4xi64>
  %13 = "tosa.transpose"(%11, %12) : (tensor<1024x1x144x16xf16>, tensor<4xi64>) -> tensor<1024x1x16x144xf16>
  %14 = "tosa.reshape"(%7) <{new_shape = array<i64: 1024, 64, 16>}> : (tensor<1024x1x64x16xf16>) -> tensor<1024x64x16xf16>
  %15 = "tosa.reshape"(%13) <{new_shape = array<i64: 1024, 16, 144>}> : (tensor<1024x1x16x144xf16>) -> tensor<1024x16x144xf16>
  %16 = "tosa.matmul"(%14, %15) : (tensor<1024x64x16xf16>, tensor<1024x16x144xf16>) -> tensor<1024x64x144xf16>
  %17 = "tosa.reshape"(%16) <{new_shape = array<i64: 1024, 1, 64, 144>}> : (tensor<1024x64x144xf16>) -> tensor<1024x1x64x144xf16>
  %18 = "migraphx.mlir.as.underlying.shape"(%17) : (tensor<1024x1x64x144xf16>) -> !migraphx.shaped<1024x1x64x144xf16, 9216x9216x144x1>
  "func.return"(%18) : (!migraphx.shaped<1024x1x64x144xf16, 9216x9216x144x1>) -> ()
}) {arch = "gfx942:sramecc+:xnack-", enable_splitk_for_tuning = true, kernel = "mixr", num_cu = 304 : i64} : () -> ()

Aborted (core dumped)

To Repro:

migraphx-driver compile /mnt/nas_share/migraphx/models/torch_benchmarks/eca_halonext26ts/eca_halonext26ts.mxr

bpickrel commented 2 months ago

This seems to have its root in an MIGraphX-only bug in the pad operator. The following test case in test_ref fails (ignore the gold value; it throws an exception before it gets that far):


TEST_CASE(pad_test)
{
    migraphx::program p;
    auto* mm = p.get_main_module();
    migraphx::shape s{migraphx::shape::float_type, {8192, 8, 23}};
    std::vector<float> ones(s.elements(), 1.0f);
    auto l0 = mm->add_literal(migraphx::literal{s, ones});

    mm->add_instruction(migraphx::make_op("pad", {{"pads", {0, 0, 0, 0, 0, 1}}}), l0);
    p.compile(migraphx::make_target("ref"));
    auto result = p.eval({}).back();
    std::vector<float> results_vector(16);
    result.visit([&](auto output) { results_vector.assign(output.begin(), output.end()); });
    std::vector<float> gold{0, 0, 0, 0, 0, 1, 2, 0, 0, 3, 4, 0, 0, 0, 0, 0};
    EXPECT(migraphx::verify::verify_rms_range(results_vector, gold));
}

It fails with the following message:

terminate called after throwing an instance of 'migraphx::version_1::exception'
  what():  /workspace/pt2_benchmarks/AMDMIGraphX/src/include/migraphx/operation.hpp:366: compute_op: Not computable: pad
Aborted (core dumped)
bpickrel commented 1 month ago

Above replication code is non-operative: that error happened on server hyd-7c-ZT09-02 and doesn't replicate elsewhere. It seems to be independent of this issue.

bpickrel commented 1 month ago

Note to self: the following reduced pass list in target::get_passes() AMDMIGraphX/src/targets/gpu/target.cpp replicates the fail. All 3 of these passes must be present to create the error:

return
    {
        enable_pass(mlir_enabled(), fuse_mlir{&ctx}),
        lowering{&ctx, options.offload_copy},
        compile_ops{&ctx, options.exhaustive_tune},
    };
bpickrel commented 1 month ago

Note to self: adding the following tostruct softmax_compiler : compiler

        if(axis < 0)
            axis += inputs.front().ndim();

solves one bug, but failures continue when we revert to the full pass list (see above).

add: similar fix in src/fuse_reduce.cpp line 424

bpickrel commented 1 month ago

I haven't verified that this fix will also work on an MI300--will get on that now.