Closed derisavi-huawei closed 5 years ago
Here is one thought. I'm not 100% sure I'm correct: The main problem here is that we are trying to inline a reduction computation, and that builds a computation tree that has reduction in a non-root node. Does this make sense?
OK, yes you are right, reduction cannot be inline currently. It is fine to have reduction at a non-root node, but we need to use compute at at the inner most level to do so.
I didn't understand "but we need to use compute at at the inner most level to do so." Can you give an example? Do you intend to work on fully supporting reduction in non-root node? I found it very beneficial when it gets to representing complicated data layouts.
I checked your example again can see what you mean by it is helpful to do so. The main problem here, though is that inline the reduction makes the reduction stage not schedulable, which is not intended behavior.
If there is a benefit of introducing such notation, we can try to form a proposal and do it. The current limitation mainly has to do with current prioritization of the project
The example above is a piece of a real project I'm working on. I'm currently working around the limitation by doing the inlining manually which makes both the algorithm and the schedule more complicated and error-prone. In fact, during the manual inlining, I made a few mistakes that I found hard to debug and fix. If there is anything I can help with, please let me know.
Can you give a concrete description of what do you expect when you inline the reduction?
I don't know what you mean by "can we schedule the reduction?". But here is how I manually inlined the reduction and got an IR.
The above python code generates the following IR (without inlining C
)
// attr [L3_A.local] storage_scope = "local"
allocate L3_A.local[float32 * 256 * 16 * 16 * 16]
// attr [L3_B.local] storage_scope = "local"
allocate L3_B.local[float32 * 16 * 512 * 16 * 16]
// attr [C] storage_scope = "global"
allocate C[float32 * 4096 * 8192]
// attr [L3_C.local] storage_scope = "local"
allocate L3_C.local[float32 * 256 * 512 * 16 * 16]
produce L3_A.local {
for (ax0, 0, 256) {
for (ax1, 0, 16) {
for (ax2, 0, 16) {
for (ax3, 0, 16) {
L3_A.local[((((((ax0*16) + ax1)*16) + ax2)*16) + ax3)] = L3_A[((((((ax0*16) + ax1)*16) + ax2)*16) + ax3)]
}
}
}
}
}
produce L3_B.local {
for (ax0, 0, 16) {
for (ax1, 0, 512) {
for (ax2, 0, 16) {
for (ax3, 0, 16) {
L3_B.local[((((((ax0*512) + ax1)*16) + ax2)*16) + ax3)] = L3_B[((((((ax0*512) + ax1)*16) + ax2)*16) + ax3)]
}
}
}
}
}
produce C {
for (y, 0, 4096) {
for (x, 0, 8192) {
C[((y*8192) + x)] = 0.000000f
for (k, 0, 256) {
C[((y*8192) + x)] = (C[((y*8192) + x)] + (L3_A.local[(((((y/16)*256) + (((k/16)*16) + (y % 16)))*16) + (k % 16))]*L3_B.local[(((((x/16)*256) + (x % 16)) + ((k/16)*131072)) + ((k % 16)*16))]))
}
}
}
}
produce L3_C.local {
for (yo.c, 0, 256) {
for (xo.c, 0, 512) {
for (yi.c, 0, 16) {
for (xi.c, 0, 16) {
L3_C.local[((((((yo.c*512) + xo.c)*16) + yi.c)*16) + xi.c)] = C[(((((yo.c*8192) + xo.c) + (yi.c*512))*16) + xi.c)]
}
}
}
}
}
produce L3_C {
for (yo, 0, 256) {
for (xo, 0, 512) {
for (yi, 0, 16) {
for (xi, 0, 16) {
L3_C[((((((yo*512) + xo)*16) + yi)*16) + xi)] = L3_C.local[((((((yo*512) + xo)*16) + yi)*16) + xi)]
}
}
}
}
}
Now the following python code in which I have manually inlined C
into L3_C
in gemm_algorithm
function (I have only included the diff here)
# remove the assignment to C
# Note how the simple algorithm for matmul has become more complicated. You can imagine how other more complicated algorithms can become,
# conflicting with Halide and TVM's philosophy of keeping the algorithm as close as possible to original mathematical definition
L3_C = tvm.compute(((M+15)//16, (P+15)//16, 16, 16),
lambda yo, xo, yi, xi: tvm.sum(A[yo*16+yi, k]*B[k, xo*16+xi], axis=k),
name = "L3_C")
the resulting lowered code is (again I'll include the diff here)
...
// allocate for buffer C is removed
...
// produce C is removed
// produce L3_C.local is replaced by the following
produce L3_C.local {
for (yo.c, 0, 256) {
for (xo.c, 0, 512) {
for (yi.c, 0, 16) {
for (xi.c, 0, 16) {
L3_C.local[((((((yo.c*512) + xo.c)*16) + yi.c)*16) + xi.c)] = 0.000000f
for (k, 0, 256) {
L3_C.local[((((((yo.c*512) + xo.c)*16) + yi.c)*16) + xi.c)] = (L3_C.local[((((((yo.c*512) + xo.c)*16) + yi.c)*16) + xi.c)] + (L3_A.local[(((((yo.c*256) + yi.c) + ((k/16)*16))*16) + (k % 16))]*L3_B.local[((((xo.c*256) + xi.c) + ((k/16)*131072)) + ((k % 16)*16))]))
}
}
}
}
}
}
...
Something I forgot to mention: In the python code above, my schedule with and without inlining is trivial. In my real problem, I need to define a fairly complicated schedule which become even more complicated when I do the manual inlining because I need to modify the schedule accordingly.
I get what you mean. This can indeed be helpful. What you want is more than the semantics of "inline". By default, inline will place the computation at the inline location but we will only be able to schedule the final spatial axis. What you want is to "inheritate" the reduction axis as well into the target stage.
What you want is along the direction of "dataflow rewriting". Which we already have in cases like rfactor and cache_write(now you can reorder and split axis of tensor, before you cache write and the cached data layout will change accordingly).
As a community project, we want to evolve the direction according to the need and use-cases community members. So we are definitely welcoming proposals of new schedule primitives like this kind.
close for now due to inactive status
Consider the following code. As it is, it crashes during
lower
. If I comment lines[C].compute_inline()
, lower doesn't crash anymore.The message I get during crash is:
My first attempt at debugging this is that I could get to the line that causes that core dump. It's https://github.com/dmlc/tvm/blob/1ae02f1e03510b742383fa70eeb72fdca8eaf397/src/schedule/message_passing.cc#L470 variable
iv
doesn't exist inbound_state
. If I skip that piece of code, I get another crash at this line https://github.com/dmlc/tvm/blob/1ae02f1e03510b742383fa70eeb72fdca8eaf397/src/schedule/message_passing.cc#L479 Variableiv
doesn't exist invalue_map
I'm completely unfamiliar with this code so any help is appreciated. @tqchen