apache / tvm

Open deep learning compiler stack for cpu, gpu and specialized accelerators
https://tvm.apache.org/
Apache License 2.0
11.58k stars 3.43k forks source link

[Bug] Unroll loop runs slowly #14756

Open KuiliangL opened 1 year ago

KuiliangL commented 1 year ago

Hello, These days I constructed some PrimFunc files and I found that the unroll loop ran more slower than the other forkind loop in some situations. I tested the For statement with different forkind and a big length of the loop. I’m curious about the reason why the unroll loop runs slower with the big extent.

Expected behavior

The different loop forkinds have the same speeds.

Actual behavior

The unroll loop ran slower than the other rorkinds loop. slow

Environment

TVM 0.10dev0 ,git commit ee319d9d23c80091da9c4fb764b1e6d49d462714

Steps to reproduce

import tvm
from tvm import tir
import time

c1=tir.const(32450,'uint32')
c2=tir.const(15000000,'uint32')
v1=tir.Var('v1','uint32')
for1=tir.For(v1,c1,c2,3,tir.Evaluate(1))
f1=tir.PrimFunc([],for1)
for2=tir.For(v1,c1,c2,1,tir.Evaluate(1))
f2=tir.PrimFunc([],for2)
for3=tir.For(v1,c1,c2,0,tir.Evaluate(1))
f3=tir.PrimFunc([],for3)

t0=time.time()
tvm.build(f2)
print(time.time()-t0)
t1=time.time()
tvm.build(f3)
print(time.time()-t1)
t2=time.time()
tvm.build(f1)
print(time.time()-t2)

Triage

KuiliangL commented 1 year ago

If we use the UnrollLoop pass optimization, it also runs slowly.


import tvm
from tvm import tir
v1=tir.Var('v1','uint32')
c1=tir.const(905892654,'uint32')
c2=tir.const(174155511,'uint32')
f1=tir.Cast('float32',c2)
ceil1=tir.ceil(f1)
c3=tir.Cast('uint32',ceil1)
c4=tir.const(35,'uint32')

v2=tir.Var('v2','float32')
c5=tir.const(0.187,'float32')
let1=tir.LetStmt(v2,c5,tir.Evaluate(0))
for1=tir.For(v1,c1,c3,3,let1)
prim=tir.PrimFunc({},for1)
b1=tir.IntImm('bool',4582)
if1=tir.IfThenElse(tir.EQ(tir.And(b1,b1),tir.Sub(b1,b1)),for1,tir.Evaluate(tir.ret(c2)))
p2=tir.PrimFunc({},if1)
mod = tvm.IRModule({"main": p2})

mod=tir.transform.UnrollLoop()(mod)