cornell-zhang / heterocl

HeteroCL: A Multi-Paradigm Programming Infrastructure for Software-Defined Heterogeneous Computing
https://cornell-zhang.github.io/heterocl/
Apache License 2.0
326 stars 92 forks source link

[Schedule] support split #26

Closed Huyuwei closed 6 years ago

Huyuwei commented 6 years ago

This PR adds split schedule primitive. It has two modes:

Transform mode example

python code:

a = hcl.placeholder((10, 20), name="a")
b = hcl.placeholder((10, 20), name="b")
c = hcl.compute(a.shape, lambda i, j: a[i, j] + b[i, j], name="c")
s = hcl.create_schedule(c)
s[c].split(c.axis[1], factor=3, mode="transform")
ir = hcl.lower(s, [a, b, c])

generated IR:

produce c {
  // attr [0] extern_scope = 0
  for (i, 0, 10) {
    for (j.outer, 0, 7) {
      for (j.inner, 0, 3) {
        c[(((j.outer*3) + j.inner) + (i*20))] = int32((int34(a[(((j.outer*3) + j.inner) + (i*20))]) + int34(b[(((j.outer*3) + j.inner) + (i*20))])))
      }
    }
  }
}

Annotate mode example

python code:

s = hcl.create_schedule(c)
s[c].split(c.axis[1], factor=3, mode="annotate")
ir = hcl.lower(s, [a, b, c])

generated IR:

produce c {
  // attr [0] extern_scope = 0
  for (i, 0, 10) {
    for "split_factor"=3 (j, 0, 20) {
      c[(j + (i*20))] = int32((int34(a[(j + (i*20))]) + int34(b[(j + (i*20))])))
    }
  }
}

Do tiling with split and reorder

Now we can use split and reorder to do tiling:

a = hcl.placeholder((10, 20), name="a")
b = hcl.placeholder((10, 20), name="b")
c = hcl.compute(a.shape, lambda i, j: a[i, j] + b[i, j], name="c")
s = hcl.create_schedule(c)
yo, yi = s[c].split(c.axis[0], factor=3, mode="transform")
xo, xi = s[c].split(c.axis[1], factor=3, mode="transform")
s[c].reorder(yo, xo, yi, xi)
ir = hcl.lower(s, [a, b, c])

generated IR:

produce c {
  // attr [0] extern_scope = 0
  for (i.outer, 0, 4) {
    for (j.outer, 0, 7) {
      for (i.inner, 0, 3) {
        for (j.inner, 0, 3) {
          c[(((j.outer*3) + j.inner) + (((i.outer*3) + i.inner)*20))] = int32((int34(a[(((j.outer*3) + j.inner) + (((i.outer*3) + i.inner)*20))]) + int34(b[(((j.outer*3) + j.inner) + (((i.outer*3) + i.inner)*20))])))
        }
      }
    }
  }
}

Can add a wrapper to directly support tiling as a schedule primitive.

Huyuwei commented 6 years ago

@seanlatias There is an issue now: tvm's reorder schedule in some cases doesn't work for ExternOp. The solution is that I only keep the relation node and commented out all other code. Is it ok to do this? Is the schedule interface for ComputeOp already broken?

zhangzhiru commented 6 years ago

@Huyuwei can you provide more detail on what errors did you run into with ExternOp?

seanlatias commented 6 years ago

For your question, I cannot understand. Can you give me an example that causes the error? What do you mean by "TVM's reorder schedule"? Is that different from our reorder schedule? We are no longer using TVM's ComputeOp anymore so it does not matter if it is broken or not. We will finally remove all ComputeOp related stuff.

Huyuwei commented 6 years ago

The error happens here https://github.com/cornell-zhang/heterocl/blob/master/tvm/src/schedule/schedule_lang.cc#L295

temp2.emplace_back(self->iter_var_exprs[pos[i]]);

Using the tiling example above, after two splits, the size of pos is 4 (i.outer, i.inner, j.outer, j.inner), but the size of iter_var_exprs is 2 (i, j), so segmentation fault happens.

seanlatias commented 6 years ago

This is because you did not add the corresponding iter_var_exprs when performing splitting. This is where you should add those.

https://github.com/cornell-zhang/heterocl/blob/bfff12d373f6150a6f72c1b2c02ccf97222c6f25/tvm/src/schedule/schedule_lang.cc#L62

You can check my previous commits (such as Fuse and Reorder) to see how I added them.

seanlatias commented 6 years ago

The Fuse example can be seen here.

https://github.com/cornell-zhang/heterocl/blob/bfff12d373f6150a6f72c1b2c02ccf97222c6f25/tvm/src/schedule/schedule_lang.cc#L259

Huyuwei commented 6 years ago

@seanlatias Thanks, the bug caused by iter_var_exprs is fixed now.

Condition statement (if likely (...)) will be inserted when the split factor is not a divisor of the loop bound. I will let you know when I fix this part.

seanlatias commented 6 years ago

We don't need the "likely" for now. Just adding an IF statement is enough.

Huyuwei commented 6 years ago

Now IF stmt is inserted. Above split and reorder example will generate ir:

produce c {
  // attr [0] extern_scope = 0
  for (i.outer, 0, 4) {
    for (j.outer, 0, 7) {
      for (i.inner, 0, 3) {
        for (j.inner, 0, 3) {
          if (((i.outer*3) < (10 - i.inner))) {
            if (((j.outer*3) < (20 - j.inner))) {
              c[(((j.outer*3) + j.inner) + (((i.outer*3) + i.inner)*20))] = int32((int34(a[(((j.outer*3) + j.inner) + (((i.outer*3) + i.inner)*20))]) + int34(b[(((j.outer*3) + j.inner) + (((i.outer*3) + i.inner)*20))])))
            }
          }
        }
      }
    }
  }
}

See https://github.com/Huyuwei/heterocl/blob/schedule/heterocl/tests/test_schedule.py#L70-L86

Huyuwei commented 6 years ago

@seanlatias The issue of IF stmt position is fixed. Now consider the code:

  def popcount(A, B): # each element in A is a 32-bit integer
    with hcl.for_(0, A.shape[0], name="x") as x:
      with hcl.for_(0, A.shape[1], name="y") as y:
        B[x, y] = 0
        with hcl.for_(0, 32) as i:
          B[x, y] += A[x, y][i]

  A = hcl.placeholder((10, 20))
  B = hcl.placeholder(A.shape)
  with hcl.stage() as C:
    popcount(A, B)

  s = hcl.create_schedule(C)
  s[C].split(C.x, factor=3)
  ir = hcl.lower(s, [A, B])

It generates IR:

// attr [stage2] storage_scope = "global"
allocate stage2[int32 * 1]
produce stage2 {
  // attr [0] extern_scope = 0
  for (x.outer, 0, 4) {
    for (x.inner, 0, 3) {
      if (((x.outer*3) < (10 - x.inner))) {
        for (y, 0, 20) {
          placeholder1[(y + (((x.outer*3) + x.inner)*20))] = 0
          for (i, 0, 32) {
            placeholder1[(y + (((x.outer*3) + x.inner)*20))] = int32((int34(placeholder1[(y + (((x.outer*3) + x.inner)*20))]) + int34(placeholder0[(y + (((x.outer*3) + x.inner)*20))][i])))
          }
        }
      }
    }
  }
}

This is covered in test case: https://github.com/Huyuwei/heterocl/blob/schedule/heterocl/tests/test_imperative_code.py#L36-L45

seanlatias commented 6 years ago

@Huyuwei Can you briefly describe what you have changed in this update? Also, what's the purpose for "CollectIfCondition"?

Huyuwei commented 6 years ago

Now each IF stmt is associated with a For loop. When we reorder several For loops, associated IF stmt will also be reordered.

In the following code, condition_a is associated with i, and condition_b is associated with j.

for (i, 0, 10):
  if (condition_a):
    for (j, 0, 20):
      if(condition_b):
        body ...

If we reorder i and j, we will get:

for (j, 0, 20):
  if (condition_b):
    for (i, 0, 10):
      if(condition_a):
        body ...

CollectIfConditions is to build a map with loop_var as the key and condition as the value.

seanlatias commented 6 years ago

I don't think this is correct. You cannot associate each IF condition with a FOR loop. Only the IF conditions generated by the SPLIT schedule will be associated with a FOR loop. Again, we need to think the most complicated case, where the users can have lots of imperative statements inside a FOR loop. It is normal to write an IF condition inside the FOR loop. With your current algorithm, the conditions written by the users will be forced to move with loop reordering. Although it is not always safe to interchange loops, it is not our duty to check.

You also remove the check for whether the reorder can be performed or not. You need to add that back.

Huyuwei commented 6 years ago

@seanlatias Two updates:

  1. Now CollectIfConditions collect only IF stmt generated by the split schedule

  2. IF stmt is ensured to be below both inner and outer for loops after reorder.

Consider the code:

a = hcl.placeholder((10, 20), name="a")
b = hcl.placeholder((10, 20), name="b")
c = hcl.compute(a.shape, lambda i, j: a[i, j] + b[i, j], name="c")
s = hcl.create_schedule(c)
yo, yi = s[c].split(c.axis[0], factor=3, mode="transform")
xo, xi = s[c].split(c.axis[1], factor=3, mode="transform")
s[c].reorder(yi, xi, yo, xo)
ir = hcl.lower(s, [a, b, c])

The IR is:

produce c {
  // attr [0] extern_scope = 0
  for (i.inner, 0, 3) {
    for (j.inner, 0, 3) {
      for (i.outer, 0, 4) {
        if (((i.outer*3) < (10 - i.inner))) {
          for (j.outer, 0, 7) {
            if (((j.outer*3) < (20 - j.inner))) {
              c[(((j.outer*3) + j.inner) + (((i.outer*3) + i.inner)*20))] = int32((int34(a[(((j.outer*3) + j.inner) + (((i.outer*3) + i.inner)*20))]) + int34(b[(((j.outer*3) + j.inner) + (((i.outer*3) + i.inner)*20))])))
            }
          }
        }
      }
    }
  }
}
Huyuwei commented 6 years ago

@seanlatias I just realized that in reorder, the axes can be non-consecutive. Axes not in the reorder list will stay the same position. See the test case: https://github.com/cornell-zhang/heterocl/blob/80cb5f434622c63ec1d3687dd1424090559f7f8e/heterocl/tests/test_schedule.py#L53-L61