cornell-zhang / heterocl

HeteroCL: A Multi-Paradigm Programming Infrastructure for Software-Defined Heterogeneous Computing
https://cornell-zhang.github.io/heterocl/
Apache License 2.0
322 stars 92 forks source link

No reuse dimension found in the body for tensor input #146

Open hecmay opened 4 years ago

hecmay commented 4 years ago

I am trying to reuse the input image in a conv2d layer in the LeNet example. The reuse_at primitive works fine with the placeholder inputs (i.e. input_image in the first conv2d). However when passing the max-pooled result to the second conv2d layer, no reuse pattern was found for it.

import heterocl as hcl
import hlib
import numpy as np

batch_size = 1000
qtype1 = hcl.Fixed(16, 14)
qtype2 = hcl.Fixed(16, 14)

def build_lenet(input_image, weight_conv1, weight_conv2,
                weight_fc1, weight_fc2, lenet):
    # first conv
    conv1 = hlib.nn.conv2d_nchw(input_image, weight_conv1, "conv1")
    tanh1 = hlib.nn.tanh(conv1, "tanh1")
    pool1 = hlib.nn.max_pool(tanh1, kernel=(2,2), stride=(2,2), name="pool1")
    # second conv
    conv2 = hlib.nn.conv2d_nchw(pool1, weight_conv2, name="conv2")
    tanh2 = hlib.nn.tanh(conv2, "tanh2")
    pool2 = hlib.nn.max_pool(tanh2, kernel=(2,2), stride=(2,2))
    # first fc
    flat = hlib.nn.flatten(pool2)
    fc1 = hlib.nn.dense(flat, weight_fc1)
    tanh3 = hlib.nn.tanh(fc1, "tanh3")
    # second fc
    fc2 =  hlib.nn.dense(tanh3, weight_fc2)
    # loss
    return hlib.nn.softmax(lenet, fc2)

input_image = hcl.placeholder((batch_size, 1, 28, 28), "input_image")
weight_conv1 = hcl.placeholder((20, 1, 5, 5), "weight_conv1", qtype1)
weight_conv2 = hcl.placeholder((50, 20, 5, 5), "weight_conv2", qtype1)
weight_fc1 = hcl.placeholder((500, 800), "weight_fc1", qtype1)
weight_fc2 = hcl.placeholder((10, 500), "weight_fc2", qtype1)
lenet = hcl.placeholder((batch_size, 10), "lenet")
s = hcl.create_schedule([input_image, weight_conv1, weight_conv2,
    weight_fc1, weight_fc2, lenet], build_lenet)

s[build_lenet.conv1].compute_at(s[build_lenet.tanh1], build_lenet.tanh1.axis[3])
s.reuse_at(input_image, s[build_lenet.conv1], build_lenet.conv1.axis[0])
s.reuse_at(build_lenet.pool1._op, s[build_lenet.conv2], build_lenet.conv2.axis[1])
print(hcl.lower(s))

The error message as followed:

check_call
    raise TVMError(py_str(_LIB.TVMGetLastError()))
heterocl.tvm._ffi.base.TVMError: [14:12:42] src/pass/generate_reuse_buffer.cc:245: No reuse is found in axis nn
seanlatias commented 4 years ago

I think the problem here is the axis. According to the error message, it seems like the first reuse_at is incorrect. There is no reuse across the 0th dimension (i.e., the batches) and this makes sense.

hecmay commented 4 years ago

That makes sense. Actually the reuse_at won't error out with placeholder input (ie. first primitive) even if it is required ti find reuse pattern at batch level. The error message comes from the second resue_at primitive when the input is a tensor. They both work at the height or width level.

zhangzhiru commented 4 years ago

I suggest we leave this issue open so we know we are missing support for non-unit stride stencil. We should also document all the existing limitations for each customization primitives.

zhangzhiru commented 4 years ago

Also, do we report error message for reuse_at() where there is no reuse opportunities?

seanlatias commented 4 years ago

My previous answer was wrong so I deleted it. This issue is exactly caused by no reuse opportunities instead of non-unit stride. For the limitation, it has been already documented in our online documentation. You can see it here.

zhangzhiru commented 4 years ago

Good to know. But is the compiler spitting proper error when reuse_at does not apply? Also, is there a fundamental challenge that prevents us from supporting non-unit stride?

seanlatias commented 4 years ago

For the first question, as you can see from the error message in the first post, it clearly specifies that axis nn does not have reuse opportunities. For other types of limitation, the compiler will spit out different messages. For the second question, the answer is no. We just need more engineering effort.

hecmay commented 4 years ago

Also, another limitation. reuse_at does not take effect when combined with compute_at primitive. For example, with the following snippet.

s[conv2].compute_at(s[tanh2], tanh2.axis[3])  #  combine CONV with tanh  
s.reuse_at(pool1._op, s[conv2], conv2.axis[2]) # linebuffer at index y 

For here I want to combine the conv2d stage conv2 into activation stage tanh2 with compute_at, and then reuse the max-pooled input from last stage (i.e., pool1 max-pooled from conv1). The IR is not as expected (reuse buffer was allocated but not implemented, and no error message thrown out):

// attr [pool1.reuse] storage_scope = "global"
allocate pool1.reuse[int32 * 1]
// attr [tanh2] storage_scope = "global"
allocate tanh2[int32 * 1000 * 50 * 8 * 8]
produce tanh2 {
  // attr [0] extern_scope = 0
  for "app_name"="tanh" (args, 0, 1000) {
    for (args0, 0, 50) {
      for (args1, 0, 8) {
        for (args2, 0, 8) {
          // attr [conv2] storage_scope = "global"
          allocate conv2[int32 * 1 * 1 * 1 * 1]
          produce conv2 {
            // attr [0] extern_scope = 0
            // attr [reducer2] storage_scope = "global"
            allocate reducer2[float32 * 1]
            produce reducer2 {
              // attr [0] extern_scope = 0
              reducer2[0] = 0.000000f
            }
            for (ra5, 0, 20) {
              for (ra6, 0, 5) {
                for (ra7, 0, 5) {
                  reducer2[0] = (float32((int48(pool1[((((args2 + ra7) + ((args1 + ra6)*12)) + (ra5*144)) + (args*2880))])*fixed48_14(weight_conv2[(((ra7 + (ra6*5)) + (ra5*25)) + (args0*500))]))) + reducer2[0])
                }
              }
            }
            conv2[0] = int32(reducer2[0])
          }
          tanh2[(((args2 + (args1*8)) + (args0*64)) + (args*3200))] = int32(tanh(float64(conv2[0])))
        }
      }
    }
  }
}

If I use reuse_at and than compute_at. The program will crash with SegFault.