Hello, I found that when the input network has multiple convolution layers with pad operations, the memory allocated for the internal tensors are expanded to be larger than necessary, and the expansion accumulates and propagates from the output stage to the input stage.
As an example, assume a network has 3 convolution layers back to back. The python script is as follows:
The output shows that a memory "pad_temp" with the size of [1*64*60*60] (=230400) is allocated and reused for the 3 layers. However, there is some wastage. The ideal size of "pad_temp" should be [1*64*58*58].
// attr [pad_temp] storage_scope = "global"
allocate pad_temp[float32 * 230400]
produce pad_temp {
for (i1, 0, 64) {
for (i2, 0, 60) {
for (i3, 0, 60) {
if (likely((2 <= i2))) {
if (likely((i2 < 58))) {
if (likely((2 <= i3))) {
if (likely((i3 < 58))) {
**pad_temp[((((i1*60) + i2)*60) + i3)] = placeholder[(((((i1*56) + i2)*56) + i3) + -114)]**
}
}
}
}
}
}
}
}
produce compute {
for (ff, 0, 64) {
for (yy, 0, 60) {
for (xx, 0, 60) {
compute[(((((ff*56) + yy)*56) + xx) + -114)] = 0.000000f
for (rc, 0, 64) {
if (likely((2 <= yy))) {
if (likely((yy < 58))) {
if (likely((2 <= xx))) {
if (likely((xx < 58))) {
compute[(((((ff*56) + yy)*56) + xx) + -114)] = (compute[(((((ff*56) + yy)*56) + xx) + -114)] + (pad_temp[(((yy*60) + xx) + (rc*3600))]*placeholder[((ff*64) + rc)]))
}
}
}
}
}
}
}
}
}
produce pad_temp {
for (i1, 0, 64) {
for (i2, 0, 60) {
for (i3, 0, 60) {
if (likely((1 <= i2))) {
if (likely((i2 < 59))) {
if (likely((1 <= i3))) {
if (likely((i3 < 59))) {
pad_temp[((((i1*60) + i2)*60) + i3)] = tvm_if_then_else(((((2 <= i2) && (i2 < 58)) && (2 <= i3)) && (i3 < 58)), compute[(((((i1*56) + i2)*56) + i3) + -114)], 0.000000f)
}
}
}
}
}
}
}
}
produce compute {
for (ff, 0, 64) {
for (yy, 0, 58) {
for (xx, 0, 58) {
compute[(((((ff*56) + yy)*56) + xx) + -57)] = 0.000000f
for (rc, 0, 64) {
for (ry, 0, 3) {
for (rx, 0, 3) {
if (likely((1 <= yy))) {
if (likely((yy < 57))) {
if (likely((1 <= xx))) {
if (likely((xx < 57))) {
compute[(((((ff*56) + yy)*56) + xx) + -57)] = (compute[(((((ff*56) + yy)*56) + xx) + -57)] + (pad_temp[(((((yy*60) + xx) + (rc*3600)) + (ry*60)) + rx)]*placeholder[((((((ff*64) + rc)*3) + ry)*3) + rx)]))
}
}
}
}
}
}
}
}
}
}
}
produce pad_temp {
for (i1, 0, 64) {
for (i2, 0, 58) {
for (i3, 0, 58) {
pad_temp[((((i1*58) + i2)*58) + i3)] = tvm_if_then_else(((((1 <= i2) && (i2 < 57)) && (1 <= i3)) && (i3 < 57)), compute[(((((i1*56) + i2)*56) + i3) + -57)], 0.000000f)
}
}
}
}
produce compute {
for (ff, 0, 64) {
for (yy, 0, 56) {
for (xx, 0, 56) {
compute[((((ff*56) + yy)*56) + xx)] = 0.000000f
for (rc, 0, 64) {
for (ry, 0, 3) {
for (rx, 0, 3) {
compute[((((ff*56) + yy)*56) + xx)] = (compute[((((ff*56) + yy)*56) + xx)] + (pad_temp[(((((yy*58) + xx) + (rc*3364)) + (ry*58)) + rx)]*placeholder[((((((ff*64) + rc)*3) + ry)*3) + rx)]))
}
}
}
}
}
}
}
With further inspection, we found that this is because in the "InferBound" pass, when it infers the tensor domain in the function "PropBoundToInputs" (in bound.cc), it only uses the domain of iteration variables, but does not consider the constraints in the tvm_if_then_else condition. Therefore, the inferred bound is larger than needed.
b's range would be infered as [-1, 57] (hence rebased to [0,58]) by TVM, but b really is [0, 56] if the if_then_else condition is considered.
Since the InferBound is traversing the graph from the output stage to input stage, backwards, the tensor allocated for the first stage (and possibly reused by the following) would be larger than necessary, which is reflected by the example.
This would be a more concerning problem when the input network has a large number of layers and with pad operation, because it will introduce quite some memory wastage. However, this type of networks arereally common (e.g., resnet, vgg, etc).
Kindly let me know whether my understand is correct. I wonder whether this problem has ever been encountered/observed when applying to TVM on DNNs, how was it solved?
Currently, we made a fix locally to use the intersect of the condition in the if_then_else with the iteration variable domain, to infer the output tensor. But I would really like to hear about your solutions.
Hello, I found that when the input network has multiple convolution layers with pad operations, the memory allocated for the internal tensors are expanded to be larger than necessary, and the expansion accumulates and propagates from the output stage to the input stage.
As an example, assume a network has 3 convolution layers back to back. The python script is as follows:
The output shows that a memory "pad_temp" with the size of [1*64*60*60] (=230400) is allocated and reused for the 3 layers. However, there is some wastage. The ideal size of "pad_temp" should be [1*64*58*58].
With further inspection, we found that this is because in the "InferBound" pass, when it infers the tensor domain in the function "PropBoundToInputs" (in bound.cc), it only uses the domain of iteration variables, but does not consider the constraints in the tvm_if_then_else condition. Therefore, the inferred bound is larger than needed.
for example (pseudo tvm halide IR)
b's range would be infered as [-1, 57] (hence rebased to [0,58]) by TVM, but b really is [0, 56] if the if_then_else condition is considered.
Since the InferBound is traversing the graph from the output stage to input stage, backwards, the tensor allocated for the first stage (and possibly reused by the following) would be larger than necessary, which is reflected by the example.
This would be a more concerning problem when the input network has a large number of layers and with pad operation, because it will introduce quite some memory wastage. However, this type of networks arereally common (e.g., resnet, vgg, etc).
Kindly let me know whether my understand is correct. I wonder whether this problem has ever been encountered/observed when applying to TVM on DNNs, how was it solved?
Currently, we made a fix locally to use the intersect of the condition in the if_then_else with the iteration variable domain, to infer the output tensor. But I would really like to hear about your solutions.
Thanks