Below script can be used to reproduce the issue. You may need to run it multiple times to reproduce, because sample_perfect_tile may sometime to hide the issue with some decision.
in_type="float16"
out_type="float16"
BS = 100
MM = 32
NN = 32
KK = 32
def create_batch_matmul(
b: int = BS, m: int = MM, n: int = NN, k: int = KK, in_dtype: str = in_type, out_dtype: str = out_type
) -> Tuple[te.Tensor, te.Tensor, te.Tensor]:
A = te.placeholder((b, m, k), name="A", dtype=in_dtype)
B = te.placeholder((b, n, k), name="B", dtype=in_dtype)
C = topi.nn.batch_matmul(A, B)
return (A, B, C)
space = meta_schedule.space_generator.PostOrderApply(
sch_rules="cuda-tensorcore",
postprocs="cuda-tensorcore",
)
database = meta_schedule.tune_tir(
mod=te.create_prim_func( create_batch_matmul () ),
target=tvm.target.Target("cuda -arch=sm_89 -max_shared_memory_per_block=49152 -max_threads_per_block=1024"),
max_trials_global = 200,
space=space,
work_dir="./test_batch_matmul/",
)
The error log is something like below.
3: tvm::tir::StmtMutator::Internal::Mutate(tvm::tir::StmtMutator*, tvm::runtime::Array<tvm::tir::Stmt, void> const&)::{lambda(tvm::tir::Stmt const&)#1}::operator()(tvm::tir::Stmt const&) const
at /hostShare/tools/tvm_all/tvm-dev/src/tir/ir/stmt_functor.cc:210
2: tvm::tir::ThreadBindingUnifier::VisitStmt_(tvm::tir::ForNode const*)
at /hostShare/tools/tvm_all/tvm-dev/src/tir/transforms/unify_thread_binding.cc:60
1: tvm::tir::ThreadBindingUnifier::VisitStmt_(tvm::tir::ForNode const*)
at /hostShare/tools/tvm_all/tvm-dev/src/tir/transforms/unify_thread_binding.cc:63
0: tvm::tir::Stmt tvm::tir::ThreadBindingUnifier::UnifyThreadBindingImpl<tvm::tir::ForNode>(tvm::tir::ForNode const*, tvm::tir::Var const&, tvm::tir::IterVar const&, tvm::Range const&)
at /hostShare/tools/tvm_all/tvm-dev/src/tir/transforms/unify_thread_binding.cc:112
File "/hostShare/tools/tvm_all/tvm-dev/src/support/parallel_for.cc", line 139
RuntimeError: parallel_for_dynamic error with [22:30:41] /hostShare/tools/tvm_all/tvm-dev/src/tir/transforms/unify_thread_binding.cc:112: Check failed: (ana.CanProveEqual(dom->extent, new_iter_var->dom->extent)) is false: ValueError: All loops that are bound to `threadIdx.y` should have the same extent. However, there are two loops with extent 12 and 4, which are not equal
The root cause is, the Batch Loop will be treated the same way as the other two spacial loops, M and N Loops, the Batch Loop will be decomposed following the SSSRRSRS fashion. But MultiLevelTilingTensorCoreNode::TransformIntermediateOutputLayout has the assumption that the inner most two S should only have M and N loops' segments, which causes that, the following AddWriteReuseTensorCore adds an "wmma.accumulator" cache write block, and fuses some loop vars, which only belong to M and N Loops, and binds them to "threadIdx.y". But the previous, also the first, fused loop bound to "threadIdx.y" contains Batch Loop's segment. So the inconsistency arises.
The fix is simple, just skip the outer Batch Loop from sample_perfect_tile process and fuse it into "blockIdx.y".
Actually I also tried more complex strategy that decomposes Batch Loop into SSS, with each segment binds to "blockIdx.y" "blockIdx.x" "threadIdx.y" separately, so inner most two Ss contain no Batch Loop segment. But this strategy is less performant for several typical workload. I think that's because AddWriteReuseTensorCore will reorder the inner loop var across the Batch Loop segment, which cause less data locality.
Below I also paste the trace before the fix (including postproc trace), you can replay it line by line and print the each loop extent to verify the inconsistency.
hi @vinx13 could you help to merge this PR, I known relax + dlight is starting to prevail for LLM workloads, but many 'old' workloads like ours still heavily rely on relay+metaschedule.
Below script can be used to reproduce the issue. You may need to run it multiple times to reproduce, because sample_perfect_tile may sometime to hide the issue with some decision.
The error log is something like below.
The root cause is, the Batch Loop will be treated the same way as the other two spacial loops, M and N Loops, the Batch Loop will be decomposed following the SSSRRSRS fashion. But
MultiLevelTilingTensorCoreNode::TransformIntermediateOutputLayout
has the assumption that the inner most two S should only have M and N loops' segments, which causes that, the followingAddWriteReuseTensorCore
adds an "wmma.accumulator" cache write block, and fuses some loop vars, which only belong to M and N Loops, and binds them to "threadIdx.y". But the previous, also the first, fused loop bound to "threadIdx.y" contains Batch Loop's segment. So the inconsistency arises.The fix is simple, just skip the outer Batch Loop from sample_perfect_tile process and fuse it into "blockIdx.y".
Actually I also tried more complex strategy that decomposes Batch Loop into SSS, with each segment binds to "blockIdx.y" "blockIdx.x" "threadIdx.y" separately, so inner most two Ss contain no Batch Loop segment. But this strategy is less performant for several typical workload. I think that's because
AddWriteReuseTensorCore
will reorder the inner loop var across the Batch Loop segment, which cause less data locality.Below I also paste the trace before the fix (including postproc trace), you can replay it line by line and print the each loop extent to verify the inconsistency.