apache / tvm

Open deep learning compiler stack for cpu, gpu and specialized accelerators
https://tvm.apache.org/
Apache License 2.0
11.58k stars 3.43k forks source link

[TIR] parallel pimitive #14906

Open uslumt opened 1 year ago

uslumt commented 1 year ago

Hello,

we try to implement multi core behavior in TensorIR schedule however following error occures using parallel primitive in the loop dimension.

Expected behavior

Splitting loops over multiple compute units(Cores) parallel.

Actual behavior

Error message: The queried subtree root tir.For#0 in SRef tree does not have compact dataflow, because its child block tir.Block#1 on SRef tree is neither a local complete block nor a local reduction block.

Environment

tvm - 0.9.dev0

Steps to reproduce

@tvm.script.ir_module class Convolution: @T.prim_func def main(inpt: T.handle, kernl: T.handle, reslt: T.handle): T.func_attr({"global_symbol": "main", "tir.noalias": True})

    input = T.match_buffer(inpt, (10, 3, 128, 128),  "float32")
    kernel = T.match_buffer(kernl, (2, 3, 3, 3), "float32")
    result = T.match_buffer(reslt, (10, 2, 124, 124), "float32")
    result_compute = T.match_buffer(reslt, (10, 2, 124, 124), "float32")

    for b, o, h, w in T.grid(10, 2, 124, 124):
        for kc, kh, kw in T.grid(3, 3, 3):
            with T.block("compute"):
                b, o, h, w, kc, kh, kw = T.axis.remap("RRRRRRR", [b, o, h, w, kc, kh, kw])
                result_compute[b, o, h, w] += input[b, kc, h+kh, w+kw] * kernel[o, kc, kh, kw]

    for b, o, h, w in T.grid(10, 2, 124, 124):
        with T.block("result"):
           vb = T.axis.reduce(10, b)
           vc_o = T.axis.reduce(2, o)
           vh = T.axis.reduce(124, h)
           vw = T.axis.reduce(124, w)
           result[vb, vc_o, vh, vw] = result_compute[vb, vc_o, vh, vw]

written_ir = Convolution sch = tvm.tir.Schedule(written_ir)

b_i, o_i, h_i, w_I, kc_i, kh_i, kw_i = sch.get_loops(sch.get_block("compute")) sch.parallel(b_i)

Best regards

cc @Hzfengsy @junrushao @quic-sanirudh @shingjan

yzh119 commented 1 year ago

Hi @uslumt , The reason you cannot parallelize the loop is because the compute block is not a reduction block.

A reduction block should have an init block indicating how to initialize the output buffer, in your case, to make compute a reduction block, you need to write a T.init inside the block body:

    for b, o, h, w, kc, kh, kw in T.grid(10, 2, 124, 124, 3, 3, 3):
            with T.block("compute"):
                vb, vo, vh, vw, vkc, vkh, vkw = T.axis.remap("SSSSRRR", [b, o, h, w, kc, kh, kw])
                with T.init():
                      result_compute[vb, vo, vh, vw] = T.float32(0)
                result_compute[vb, vo, vh, vw] = result_compute[vb, vo, vh, vw] + input[vb, vkc, vh+vkh, vw+vkw] * kernel[vo, vkc, vkh, vkw]

Another issue with your program is you seem annotate all block axes as "reduction", which is wrong because in compute block, b, o, h, w should be "spatial" axes. If you don't understand how to differentiate spatial/reduce axis, you could go through this tutorial first: https://mlc.ai/chapter_tensor_program/case_study.html.