[Bug] Relay program with broadcasting operators and `dense` cannot be compiled

wzh99 commented 2 years ago

Expected behavior

The following Relay program should be successfully compiled:

def @main(%x0: Tensor[(1, 2), float32] /* ty=Tensor[(1, 2), float32] */, %x1: Tensor[(1, 2), float32] /* ty=Tensor[(1, 2), float32] */, %x2: Tensor[(1, 2), float32] /* ty=Tensor[(1, 2), float32] */, %w: Tensor[(2, 2), float32] /* ty=Tensor[(2, 2), float32] */) -> Tensor[(1, 1, 2), float32] {
  %0 = maximum(%x0, %x1) /* ty=Tensor[(1, 2), float32] */;
  %1 = nn.dense(%x2, %w, units=None) /* ty=Tensor[(1, 2), float32] */;
  %2 = expand_dims(%0, axis=1) /* ty=Tensor[(1, 1, 2), float32] */;
  %3 = multiply(%0, %1) /* ty=Tensor[(1, 2), float32] */;
  add(%2, %3) /* ty=Tensor[(1, 1, 2), float32] */
}

Actual behavior

An error is reported during compilation:

Traceback (most recent call last):
  File "/Users/wzh/tvm-bug/bug_dense_bcast.py", line 16, in <module>
    lib = relay.build(mod, target='llvm')
  File "/Users/wzh/tvm-dev/python/tvm/relay/build_module.py", line 416, in build
    graph_json, runtime_mod, params = bld_mod.build(
  File "/Users/wzh/tvm-dev/python/tvm/relay/build_module.py", line 154, in build
    self._build(mod, raw_targets, executor, runtime, workspace_memory_pools, mod_name)
  File "/Users/wzh/tvm-dev/python/tvm/_ffi/_ctypes/packed_func.py", line 237, in __call__
    raise get_last_ffi_error()
ValueError: Traceback (most recent call last):
  [bt] (8) 9   libtvm.dylib                        0x000000011400e502 tvm::relay::tec::LowerTensorExprMutator::DeviceAwareVisitExpr_(tvm::relay::CallNode const*) + 4578
  [bt] (7) 8   libtvm.dylib                        0x0000000113ffd163 tvm::relay::tec::TECompilerImpl::Lower(tvm::relay::tec::CCacheKey const&, tvm::runtime::String) + 131
  [bt] (6) 7   libtvm.dylib                        0x0000000113ffd02e tvm::relay::tec::TECompilerImpl::Lower(tvm::relay::tec::CCacheKey const&, std::__1::function<tvm::runtime::String (tvm::runtime::String)>) + 110
  [bt] (5) 6   libtvm.dylib                        0x00000001140018e4 tvm::relay::tec::TECompilerImpl::LowerInternal(tvm::relay::tec::CCacheKey const&, std::__1::function<tvm::runtime::String (tvm::runtime::String)>) + 2596
  [bt] (4) 5   libtvm.dylib                        0x000000011401884d tvm::relay::tec::PrimFuncFor(tvm::relay::Function const&, tvm::Target const&, std::__1::function<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > (std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >)>) + 141
  [bt] (3) 4   libtvm.dylib                        0x000000011401a626 tvm::relay::tec::ScheduleBuilder::Create(tvm::relay::Function const&, std::__1::function<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > (std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >)>) + 7414
  [bt] (2) 3   libtvm.dylib                        0x00000001140f249a tvm::relay::OpImplementation::Schedule(tvm::Attrs const&, tvm::runtime::Array<tvm::te::Tensor, void> const&, tvm::Target const&) + 202
  [bt] (1) 2   libtvm.dylib                        0x00000001142c0b0d tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<TVMFuncCreateFromCFunc::$_2> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) + 109
  [bt] (0) 1   libtvm.dylib                        0x00000001142db328 tvm::runtime::Backtrace() + 24
  File "/Users/wzh/tvm-dev/python/tvm/_ffi/_ctypes/packed_func.py", line 81, in cfun
    rv = local_pyfunc(*pyargs)
  File "/Users/wzh/tvm-dev/python/tvm/relay/op/strategy/generic.py", line 51, in wrapper
    return topi_schedule(outs)
  File "/Users/wzh/tvm-dev/python/tvm/autotvm/task/topi_integration.py", line 242, in wrapper
    return topi_schedule(cfg, outs, *args, **kwargs)
  File "/Users/wzh/tvm-dev/python/tvm/topi/x86/dense.py", line 279, in schedule_dense_pack
    traverse_inline(s, outs[0].op, _callback)
  File "/Users/wzh/tvm-dev/python/tvm/topi/utils.py", line 81, in traverse_inline
    _traverse(final_op)
  File "/Users/wzh/tvm-dev/python/tvm/topi/utils.py", line 78, in _traverse
    _traverse(tensor.op)
  File "/Users/wzh/tvm-dev/python/tvm/topi/utils.py", line 78, in _traverse
    _traverse(tensor.op)
  File "/Users/wzh/tvm-dev/python/tvm/topi/utils.py", line 79, in _traverse
    callback(op)
  File "/Users/wzh/tvm-dev/python/tvm/topi/x86/dense.py", line 277, in _callback
    _schedule_dense_pack_template(cfg, s, op.output(0), outs[0])
  File "/Users/wzh/tvm-dev/python/tvm/topi/x86/dense.py", line 70, in _schedule_dense_pack_template
    y, x = s[O].op.axis
ValueError: too many values to unpack (expected 2)

Environment

macOS 12.4. Compiled using Clang 13.1.6 with LLVM support. TVM commit 8341e33.

Steps to reproduce

from tvm import relay, transform, IRModule

x0 = relay.var('x0', shape=(1, 2))
x1 = relay.var('x1', shape=(1, 2))
x2 = relay.var('x2', shape=(1, 2))
w = relay.var('w', shape=(2, 2))
y0 = relay.maximum(x0, x1)
y1 = relay.nn.dense(x2, w)
y2 = relay.expand_dims(y0, 1)
y3 = y0 * y1
y4 = y2 + y3
mod = IRModule.from_expr(y4)
mod = relay.transform.InferType()(mod)
print(mod)
with transform.PassContext(opt_level=1):
    lib = relay.build(mod, target='llvm')

wzh99 commented 2 years ago

This seems to be minimal test case that can reproduce this bug. If I further remove some expression nodes, the program can be successfully compiled. maximum, multiply and add in this test case can be replaced by other broadcasting operators and the bug still occurs.

wzh99 commented 2 years ago

@ganler I have reported all the bugs that I have discovered. Sorry for bombarding you with several bug issues. Thanks for your attention on the bugs I have previously reported. Which bug(s) do you think that we shall fix first?

ganler commented 2 years ago

@wzh99 Thanks for improving TVM's robustness. Definitely realistic cases should prioritized as it is more beneficial to the community. While other bugs might be uncommon (adversarial) or avoidable, this one seems to be realistic that I found it is kinda hard to work it around w/o violating semantics.

I also spent some time looking into it. The layout mismatch happens at executing between wrap_compute_dense(topi.x86.dense_pack) and wrap_topi_schedule(topi.x86.schedule_dense_pack).

https://github.com/apache/tvm/blob/705993e485a8c4b8a94a9c4d6e770c170b6fe1bc/python/tvm/relay/op/strategy/x86.py#L497

Note this is a callback. Its invocation happens in C++ side. So it should be related to transition between strategies.

wzh99 commented 2 years ago

@ganler Thanks for your investigation into this bug. I also look into the compilation process of this test case. I would like to share my observation as well.

The direct cause of this bug is here: https://github.com/apache/tvm/blob/ec918644ef01df81354bcf958f686e2b8863dac4/python/tvm/topi/x86/dense.py#L69-L70 The TOPI implementation of dense_pack.x86 here assumes that s[O].op.axis has two elements. However, for this test case, s[O].op.axis has three elements. Therefore, the Python interpreter reports a ValueError.

A quick fix of this bug is to modify the condition of this if-statement as follows:

if C != O and len(s[O].op.axis) == 2:
    y, x = s[O].op.axis
    ...

The then-branch performs additional transformations on the schedule. Without these transformations, the schedule is still valid (perhaps with performance degradation). At least there will be no ValueError.

However, I do not think that the problem is completely resolved here. In the TOPI implementation of dense_pack.x86, I find out in my Python debugger that the rank of symbolic tensor O is 3. I just wonder why a tensor of rank 3 is passed to TOPI implementation of dense_pack.x86 which always outputs a tensor of rank 2?

Here is my analysis. This test case is compiled at optimization level 1. At this level, one important optimization is operator fusion. I think that this rank mismatch is possibly caused by operator fusion. I print out the fused Relay program in the following:

def @main(%x0: Tensor[(1, 2), float32] /* ty=Tensor[(1, 2), float32] */, %x1: Tensor[(1, 2), float32] /* ty=Tensor[(1, 2), float32] */, %x2: Tensor[(1, 2), float32] /* ty=Tensor[(1, 2), float32] */, %w: Tensor[(2, 2), float32] /* ty=Tensor[(2, 2), float32] */) -> Tensor[(1, 1, 2), float32] {
  %4 = fn (%p0: Tensor[(1, 2), float32] /* ty=Tensor[(1, 2), float32] */, %p1: Tensor[(1, 2), float32] /* ty=Tensor[(1, 2), float32] */, %p2: Tensor[(1, 2), float32] /* ty=Tensor[(1, 2), float32] */, %p3: Tensor[(2, 2), float32] /* ty=Tensor[(2, 2), float32] */, Primitive=1) -> Tensor[(1, 1, 2), float32] {
    %0 = maximum(%p0, %p1) /* ty=Tensor[(1, 2), float32] */;
    %1 = nn.dense(%p2, %p3, units=None) /* ty=Tensor[(1, 2), float32] */;
    %2 = expand_dims(%0, axis=1) /* ty=Tensor[(1, 1, 2), float32] */;
    %3 = multiply(%0, %1) /* ty=Tensor[(1, 2), float32] */;
    add(%2, %3) /* ty=Tensor[(1, 1, 2), float32] */
  } /* ty=fn (Tensor[(1, 2), float32], Tensor[(1, 2), float32], Tensor[(1, 2), float32], Tensor[(2, 2), float32]) -> Tensor[(1, 1, 2), float32] */;
  %4(%x0, %x1, %x2, %w) /* ty=Tensor[(1, 1, 2), float32] */
}

It seems that the whole graph is fused to a single group. The output of this group is a tensor of rank 3. Since a group is a single scheduling unit, I guess that is why dense_pack.x86 takes a rank 3 tensor as O.

I also try a simpler program with a dense and a broadcasting operator:

def @main(%x1: Tensor[(1, 1, 2), float32] /* ty=Tensor[(1, 1, 2), float32] */, %x2: Tensor[(1, 2), float32] /* ty=Tensor[(1, 2), float32] */, %w: Tensor[(2, 2), float32] /* ty=Tensor[(2, 2), float32] */) -> Tensor[(1, 1, 2), float32] {
  %0 = nn.dense(%x2, %w, units=None) /* ty=Tensor[(1, 2), float32] */;
  add(%x1, %0) /* ty=Tensor[(1, 1, 2), float32] */
}

The fused version is:

def @main(%x1: Tensor[(1, 1, 2), float32] /* ty=Tensor[(1, 1, 2), float32] */, %x2: Tensor[(1, 2), float32] /* ty=Tensor[(1, 2), float32] */, %w: Tensor[(2, 2), float32] /* ty=Tensor[(2, 2), float32] */) -> Tensor[(1, 1, 2), float32] {
  %0 = fn (%p01: Tensor[(1, 2), float32] /* ty=Tensor[(1, 2), float32] */, %p11: Tensor[(2, 2), float32] /* ty=Tensor[(2, 2), float32] */, Primitive=1) -> Tensor[(1, 2), float32] {
    nn.dense(%p01, %p11, units=None) /* ty=Tensor[(1, 2), float32] */
  } /* ty=fn (Tensor[(1, 2), float32], Tensor[(2, 2), float32]) -> Tensor[(1, 2), float32] */;
  %1 = %0(%x2, %w) /* ty=Tensor[(1, 2), float32] */;
  %2 = fn (%p0: Tensor[(1, 1, 2), float32] /* ty=Tensor[(1, 1, 2), float32] */, %p1: Tensor[(1, 2), float32] /* ty=Tensor[(1, 2), float32] */, Primitive=1) -> Tensor[(1, 1, 2), float32] {
    add(%p0, %p1) /* ty=Tensor[(1, 1, 2), float32] */
  } /* ty=fn (Tensor[(1, 1, 2), float32], Tensor[(1, 2), float32]) -> Tensor[(1, 1, 2), float32] */;
  %2(%x1, %1) /* ty=Tensor[(1, 1, 2), float32] */
}

In this case, the dense and the broadcasting add are NOT fused together. I have no idea why they are fused in the original program. There is probably a bug in the fusing implementation. However, I am not familiar with the details of the implementation. Perhaps I need some more time, and possibly help from those who are familiar with the implementation, to have a complete understanding of this problem.

wzh99 commented 2 years ago

I have found a more surprising case of operator fusion. I exchange the two inputs of both multiply and add in the original test case:

def @main(%x2: Tensor[(1, 2), float32] /* ty=Tensor[(1, 2), float32] */, %w: Tensor[(2, 2), float32] /* ty=Tensor[(2, 2), float32] */, %x0: Tensor[(1, 2), float32] /* ty=Tensor[(1, 2), float32] */, %x1: Tensor[(1, 2), float32] /* ty=Tensor[(1, 2), float32] */) -> Tensor[(1, 1, 2), float32] {
  %0 = nn.dense(%x2, %w, units=None) /* ty=Tensor[(1, 2), float32] */;
  %1 = maximum(%x0, %x1) /* ty=Tensor[(1, 2), float32] */;
  %2 = multiply(%0, %1) /* ty=Tensor[(1, 2), float32] */;
  %3 = expand_dims(%1, axis=1) /* ty=Tensor[(1, 1, 2), float32] */;
  add(%2, %3) /* ty=Tensor[(1, 1, 2), float32] */
}

This program can be successfully compiled! No error is reported. I also check the fused version of this program:

def @main(%x2: Tensor[(1, 2), float32] /* ty=Tensor[(1, 2), float32] */, %w: Tensor[(2, 2), float32] /* ty=Tensor[(2, 2), float32] */, %x0: Tensor[(1, 2), float32] /* ty=Tensor[(1, 2), float32] */, %x1: Tensor[(1, 2), float32] /* ty=Tensor[(1, 2), float32] */) -> Tensor[(1, 1, 2), float32] {
  %1 = fn (%p01: Tensor[(1, 2), float32] /* ty=Tensor[(1, 2), float32] */, %p11: Tensor[(1, 2), float32] /* ty=Tensor[(1, 2), float32] */, Primitive=1) -> Tensor[(1, 2), float32] {
    maximum(%p01, %p11) /* ty=Tensor[(1, 2), float32] */
  } /* ty=fn (Tensor[(1, 2), float32], Tensor[(1, 2), float32]) -> Tensor[(1, 2), float32] */;
  %3 = %1(%x0, %x1) /* ty=Tensor[(1, 2), float32] */;
  %4 = fn (%p02: Tensor[(1, 2), float32] /* ty=Tensor[(1, 2), float32] */, %p12: Tensor[(2, 2), float32] /* ty=Tensor[(2, 2), float32] */, %p2: Tensor[(1, 2), float32] /* ty=Tensor[(1, 2), float32] */, Primitive=1) -> Tensor[(1, 2), float32] {
    %2 = nn.dense(%p02, %p12, units=None) /* ty=Tensor[(1, 2), float32] */;
    multiply(%2, %p2) /* ty=Tensor[(1, 2), float32] */
  } /* ty=fn (Tensor[(1, 2), float32], Tensor[(2, 2), float32], Tensor[(1, 2), float32]) -> Tensor[(1, 2), float32] */;
  %5 = %4(%x2, %w, %3) /* ty=Tensor[(1, 2), float32] */;
  %6 = fn (%p0: Tensor[(1, 2), float32] /* ty=Tensor[(1, 2), float32] */, %p1: Tensor[(1, 2), float32] /* ty=Tensor[(1, 2), float32] */, Primitive=1) -> Tensor[(1, 1, 2), float32] {
    %0 = expand_dims(%p0, axis=1) /* ty=Tensor[(1, 1, 2), float32] */;
    add(%p1, %0) /* ty=Tensor[(1, 1, 2), float32] */
  } /* ty=fn (Tensor[(1, 2), float32], Tensor[(1, 2), float32]) -> Tensor[(1, 1, 2), float32] */;
  %6(%3, %5) /* ty=Tensor[(1, 1, 2), float32] */
}

This time the nodes are not fused into one single group, but three instead. Note that I only exchange the predecessors of two nodes, but the results of operator fusion are so different.

apache / tvm