[TIR] parallel pimitive

Hello,

we try to implement multi core behavior in TensorIR schedule however following error occures using parallel primitive in the loop dimension.

Expected behavior

Splitting loops over multiple compute units(Cores) parallel.

Actual behavior

Error message: The queried subtree root tir.For#0 in SRef tree does not have compact dataflow, because its child block tir.Block#1 on SRef tree is neither a local complete block nor a local reduction block.

Environment

tvm - 0.9.dev0

Steps to reproduce

@tvm.script.ir_module class Convolution: @T.prim_func def main(inpt: T.handle, kernl: T.handle, reslt: T.handle): T.func_attr({"global_symbol": "main", "tir.noalias": True})

    input = T.match_buffer(inpt, (10, 3, 128, 128),  "float32")
    kernel = T.match_buffer(kernl, (2, 3, 3, 3), "float32")
    result = T.match_buffer(reslt, (10, 2, 124, 124), "float32")
    result_compute = T.match_buffer(reslt, (10, 2, 124, 124), "float32")

    for b, o, h, w in T.grid(10, 2, 124, 124):
        for kc, kh, kw in T.grid(3, 3, 3):
            with T.block("compute"):
                b, o, h, w, kc, kh, kw = T.axis.remap("RRRRRRR", [b, o, h, w, kc, kh, kw])
                result_compute[b, o, h, w] += input[b, kc, h+kh, w+kw] * kernel[o, kc, kh, kw]

    for b, o, h, w in T.grid(10, 2, 124, 124):
        with T.block("result"):
           vb = T.axis.reduce(10, b)
           vc_o = T.axis.reduce(2, o)
           vh = T.axis.reduce(124, h)
           vw = T.axis.reduce(124, w)
           result[vb, vc_o, vh, vw] = result_compute[vb, vc_o, vh, vw]

written_ir = Convolution sch = tvm.tir.Schedule(written_ir)

b_i, o_i, h_i, w_I, kc_i, kh_i, kw_i = sch.get_loops(sch.get_block("compute")) sch.parallel(b_i)

Best regards

cc @Hzfengsy @junrushao @quic-sanirudh @shingjan

Hi @uslumt , The reason you cannot parallelize the loop is because the compute block is not a reduction block.

A reduction block should have an init block indicating how to initialize the output buffer, in your case, to make compute a reduction block, you need to write a T.init inside the block body:

    for b, o, h, w, kc, kh, kw in T.grid(10, 2, 124, 124, 3, 3, 3):
            with T.block("compute"):
                vb, vo, vh, vw, vkc, vkh, vkw = T.axis.remap("SSSSRRR", [b, o, h, w, kc, kh, kw])
                with T.init():
                      result_compute[vb, vo, vh, vw] = T.float32(0)
                result_compute[vb, vo, vh, vw] = result_compute[vb, vo, vh, vw] + input[vb, vkc, vh+vkh, vw+vkw] * kernel[vo, vkc, vkh, vkw]

Another issue with your program is you seem annotate all block axes as "reduction", which is wrong because in compute block, b, o, h, w should be "spatial" axes. If you don't understand how to differentiate spatial/reduce axis, you could go through this tutorial first: https://mlc.ai/chapter_tensor_program/case_study.html.

apache / tvm