ROCm / rocMLIR

129 stars 40 forks source link

Convolution hang/crash #1407

Closed pfultz2 closed 9 months ago

pfultz2 commented 9 months ago

This is trace from the crash that includes the mlir program and tuning problem and solution keys:

Problem: gfx90a:sramecc+:xnack- 110     conv -F 1 -f GNCHW -I HWNGC -O NGCHW -n 1 -c 512 -H 256 -W 256 -k 512 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -g 1
Benchmarking solution: 128,256,4,64,128,4,1,1
module {
  func.func @mlir_transpose_convolution_add(%arg0: !migraphx.shaped<1x512x256x256xf32, 0x1x0x0>, %arg1: !migraphx.shaped<256x256x1x512xf32, 131072x512x512x1>, %arg2: !migraphx.shaped<512x512x3x3xf32, 4608x9x3x1>) -> !migraphx.shaped<1x512x256x256xf32, 33554432x65536x256x1> attributes {arch = "", kernel = "mixr", num_cu = 0 : i64} {
    %0 = migraphx.transpose %arg1 {permutation = [2, 3, 0, 1]} : <256x256x1x512xf32, 131072x512x512x1> -> <1x512x256x256xf32, 512x1x131072x512>
    %1 = migraphx.convolution %0, %arg2 {dilation = [1, 1], group = 1 : i64, padding = [1, 1, 1, 1], padding_mode = 0 : i64, stride = [1, 1]} : <1x512x256x256xf32, 512x1x131072x512>, <512x512x3x3xf32, 4608x9x3x1> -> <1x512x256x256xf32, 33554432x65536x256x1>
    %2 = migraphx.add %1, %arg0 : <1x512x256x256xf32, 33554432x65536x256x1>, <1x512x256x256xf32, 0x1x0x0> -> <1x512x256x256xf32, 33554432x65536x256x1>
    return %2 : !migraphx.shaped<1x512x256x256xf32, 33554432x65536x256x1>
  }
}

DoD

krzysz00 commented 9 months ago

I confirm reproducability of a runtime memory fault on a gfx90a system.

(I also observe this is a fairly weird convolution where the input layout's something like HWNC but we've got a KCYX convolution, so we're hitting some very rarely exercised bits of code ... but why runtime memory fault, I haven't a clue)

krzysz00 commented 9 months ago

Note to the future, disabling collapseContiguousMerges() (that is, the following diff

diff --git a/mlir/lib/Dialect/Rock/utility/transformMapUtils.cpp b/mlir/lib/Dialect/Rock/utility/transformMapUtils.cpp
index 9344904be788..948ddc456760 100644
--- a/mlir/lib/Dialect/Rock/utility/transformMapUtils.cpp
+++ b/mlir/lib/Dialect/Rock/utility/transformMapUtils.cpp
@@ -815,6 +815,7 @@ int64_t mlir::rock::getMaxVectorizationForDatatype(

 ArrayAttr mlir::rock::collapseContiguousMerges(ArrayAttr transforms,
                                                ArrayRef<int64_t> outputShape) {
+  return transforms;
   ContiguousMergesMap contigousMerges =
       findContiguousGroups(transforms, outputShape);
   SmallVector<Attribute> newTransformMaps;

is a resolution. I wouldn't use it as The Solution unless debugging fails massively, but at least we've got a small set of functions to scrutinize for incorrect behavior under edge cases.

krzysz00 commented 9 months ago

Further update (patch is cooking, pending adding a test for this case):

The general shape of the buggy scenario looks like H, W, N = Merge{256, 256, 1}(gemmN, K = gemmM; permute (HKWN) to (NKHW); where the "hey, are we dividing things up only to multiply them back together again" code gets too optimistic about roping in unit dimensions that aren't actually in the middle of the contiguous region, leading to that first Merge being incorrectly rewritten to Merge{1, 1, 65536}, which then launches array accesses into the stratosphere. The correct rewrite, post-patch, is going to Merge{1, 65536, 1}, which does what we were wanting this whole time.