apache / tvm

Open deep learning compiler stack for cpu, gpu and specialized accelerators
https://tvm.apache.org/
Apache License 2.0
11.75k stars 3.47k forks source link

Inline Reduction #902

Closed derisavi-huawei closed 5 years ago

derisavi-huawei commented 6 years ago

Consider the following code. As it is, it crashes during lower. If I comment line s[C].compute_inline(), lower doesn't crash anymore.

import tvm
dtype = "float32"
def gemm_algorithm(M, N, P):
    k = tvm.reduce_axis((0, N), 'k')
    # The input is given in a 4D layout
    L3_A = tvm.placeholder(((M+15)//16, (N+15)//16, 16, 16), dtype, name = 'L3_A')
    L3_B = tvm.placeholder(((N+15)//16, (P+15)//16, 16, 16), dtype, name = 'L3_B')
    # We convert them to a proper 2D tensor so that we can define matmul on it
    A = tvm.compute((M, N), lambda y, x: L3_A[y//16, x//16, y%16, x%16], name="A")
    B = tvm.compute((N, P), lambda y, x: L3_B[y//16, x//16, y%16, x%16], name="B")
    C = tvm.compute(
        (M, P),
        lambda y, x: tvm.sum(A[y, k] * B[k, x], axis = k),
        name = 'C')
    # We finally convert the 2D layout back to 4D layout
    L3_C = tvm.compute(((M+15)//16, (P+15)//16, 16, 16),
                       lambda yo, xo, yi, xi: C[yo*16+yi, xo*16+xi],
                       name = "L3_C")
    return L3_A, L3_B, L3_C, A, B, C

def gemm():
    # The size of the input matrices
    M = 4096
    N = 256
    P = 8192
    L3_A, L3_B, L3_C, A, B, C = gemm_algorithm(M, N, P)

    # Default schedule
    s = tvm.create_schedule(L3_C.op)

    L2_A = s.cache_read(L3_A, "local", [A])
    L2_B = s.cache_read(L3_B, "local", [B])

    L2_C = s.cache_write(L3_C, "local")

    s[A].compute_inline()
    s[B].compute_inline()
    s[C].compute_inline()

    print(tvm.lower(s, [L3_A, L3_B, L3_C], simple_mode=True))

gemm()

The message I get during crash is:

terminate called after throwing an instance of 'std::out_of_range'
  what():  _Map_base::at
Aborted (core dumped)

My first attempt at debugging this is that I could get to the line that causes that core dump. It's https://github.com/dmlc/tvm/blob/1ae02f1e03510b742383fa70eeb72fdca8eaf397/src/schedule/message_passing.cc#L470 variable iv doesn't exist in bound_state. If I skip that piece of code, I get another crash at this line https://github.com/dmlc/tvm/blob/1ae02f1e03510b742383fa70eeb72fdca8eaf397/src/schedule/message_passing.cc#L479 Variable iv doesn't exist in value_map I'm completely unfamiliar with this code so any help is appreciated. @tqchen

derisavi-huawei commented 6 years ago

Here is one thought. I'm not 100% sure I'm correct: The main problem here is that we are trying to inline a reduction computation, and that builds a computation tree that has reduction in a non-root node. Does this make sense?

tqchen commented 6 years ago

OK, yes you are right, reduction cannot be inline currently. It is fine to have reduction at a non-root node, but we need to use compute at at the inner most level to do so.

derisavi-huawei commented 6 years ago

I didn't understand "but we need to use compute at at the inner most level to do so." Can you give an example? Do you intend to work on fully supporting reduction in non-root node? I found it very beneficial when it gets to representing complicated data layouts.

tqchen commented 6 years ago

I checked your example again can see what you mean by it is helpful to do so. The main problem here, though is that inline the reduction makes the reduction stage not schedulable, which is not intended behavior.

If there is a benefit of introducing such notation, we can try to form a proposal and do it. The current limitation mainly has to do with current prioritization of the project

derisavi-huawei commented 6 years ago

The example above is a piece of a real project I'm working on. I'm currently working around the limitation by doing the inlining manually which makes both the algorithm and the schedule more complicated and error-prone. In fact, during the manual inlining, I made a few mistakes that I found hard to debug and fix. If there is anything I can help with, please let me know.

tqchen commented 6 years ago

Can you give a concrete description of what do you expect when you inline the reduction?

derisavi-huawei commented 6 years ago

I don't know what you mean by "can we schedule the reduction?". But here is how I manually inlined the reduction and got an IR. The above python code generates the following IR (without inlining C)

// attr [L3_A.local] storage_scope = "local"
allocate L3_A.local[float32 * 256 * 16 * 16 * 16]
// attr [L3_B.local] storage_scope = "local"
allocate L3_B.local[float32 * 16 * 512 * 16 * 16]
// attr [C] storage_scope = "global"
allocate C[float32 * 4096 * 8192]
// attr [L3_C.local] storage_scope = "local"
allocate L3_C.local[float32 * 256 * 512 * 16 * 16]
produce L3_A.local {
  for (ax0, 0, 256) {
    for (ax1, 0, 16) {
      for (ax2, 0, 16) {
        for (ax3, 0, 16) {
          L3_A.local[((((((ax0*16) + ax1)*16) + ax2)*16) + ax3)] = L3_A[((((((ax0*16) + ax1)*16) + ax2)*16) + ax3)]
        }
      }
    }
  }
}
produce L3_B.local {
  for (ax0, 0, 16) {
    for (ax1, 0, 512) {
      for (ax2, 0, 16) {
        for (ax3, 0, 16) {
          L3_B.local[((((((ax0*512) + ax1)*16) + ax2)*16) + ax3)] = L3_B[((((((ax0*512) + ax1)*16) + ax2)*16) + ax3)]
        }
      }
    }
  }
}
produce C {
  for (y, 0, 4096) {
    for (x, 0, 8192) {
      C[((y*8192) + x)] = 0.000000f
      for (k, 0, 256) {
        C[((y*8192) + x)] = (C[((y*8192) + x)] + (L3_A.local[(((((y/16)*256) + (((k/16)*16) + (y % 16)))*16) + (k % 16))]*L3_B.local[(((((x/16)*256) + (x % 16)) + ((k/16)*131072)) + ((k % 16)*16))]))
      }
    }
  }
}
produce L3_C.local {
  for (yo.c, 0, 256) {
    for (xo.c, 0, 512) {
      for (yi.c, 0, 16) {
        for (xi.c, 0, 16) {
          L3_C.local[((((((yo.c*512) + xo.c)*16) + yi.c)*16) + xi.c)] = C[(((((yo.c*8192) + xo.c) + (yi.c*512))*16) + xi.c)]
        }
      }
    }
  }
}
produce L3_C {
  for (yo, 0, 256) {
    for (xo, 0, 512) {
      for (yi, 0, 16) {
        for (xi, 0, 16) {
          L3_C[((((((yo*512) + xo)*16) + yi)*16) + xi)] = L3_C.local[((((((yo*512) + xo)*16) + yi)*16) + xi)]
        }
      }
    }
  }
}

Now the following python code in which I have manually inlined C into L3_C in gemm_algorithm function (I have only included the diff here)

    # remove the assignment to C
    # Note how the simple algorithm for matmul has become more complicated. You can imagine how other more complicated algorithms can become, 
    # conflicting with Halide and TVM's philosophy of keeping the algorithm as close as possible to original mathematical definition
    L3_C = tvm.compute(((M+15)//16, (P+15)//16, 16, 16),
                       lambda yo, xo, yi, xi: tvm.sum(A[yo*16+yi, k]*B[k, xo*16+xi], axis=k),
                       name = "L3_C")

the resulting lowered code is (again I'll include the diff here)

...
// allocate for buffer C is removed
...
// produce C is removed
// produce L3_C.local is replaced by the following
produce L3_C.local {
  for (yo.c, 0, 256) {
    for (xo.c, 0, 512) {
      for (yi.c, 0, 16) {
        for (xi.c, 0, 16) {
          L3_C.local[((((((yo.c*512) + xo.c)*16) + yi.c)*16) + xi.c)] = 0.000000f
          for (k, 0, 256) {
            L3_C.local[((((((yo.c*512) + xo.c)*16) + yi.c)*16) + xi.c)] = (L3_C.local[((((((yo.c*512) + xo.c)*16) + yi.c)*16) + xi.c)] + (L3_A.local[(((((yo.c*256) + yi.c) + ((k/16)*16))*16) + (k % 16))]*L3_B.local[((((xo.c*256) + xi.c) + ((k/16)*131072)) + ((k % 16)*16))]))
          }
        }
      }
    }
  }
}
...
derisavi-huawei commented 6 years ago

Something I forgot to mention: In the python code above, my schedule with and without inlining is trivial. In my real problem, I need to define a fairly complicated schedule which become even more complicated when I do the manual inlining because I need to modify the schedule accordingly.

tqchen commented 6 years ago

I get what you mean. This can indeed be helpful. What you want is more than the semantics of "inline". By default, inline will place the computation at the inline location but we will only be able to schedule the final spatial axis. What you want is to "inheritate" the reduction axis as well into the target stage.

What you want is along the direction of "dataflow rewriting". Which we already have in cases like rfactor and cache_write(now you can reorder and split axis of tensor, before you cache write and the cached data layout will change accordingly).

As a community project, we want to evolve the direction according to the need and use-cases community members. So we are definitely welcoming proposals of new schedule primitives like this kind.

tqchen commented 5 years ago

close for now due to inactive status