Open liqiangxl opened 3 weeks ago
Some explotary observations: Added a redundant reshape before ln bwd
# reshape T1 to 3D tensor, multiply by 1, reshape back to 2D tensor
# this reshape still allows scheduler to project buffer to inputs
# but current heuristics disabled the projection and leads to lower performance
G0 = fd.define_scalar(256, dtype=DataType.Int)
C0 = fd.ops.div(T0.size(1), G0)
V1 = fd.define_vector([T1.size(0), C0, G0], dtype=DataType.Int)
V2 = fd.define_vector([T1.size(0), T1.size(1)], dtype=DataType.Int)
T1 = fd.ops.reshape(T1, new_shape=V1)
S1 = fd.define_scalar(1.0, dtype=DataType.Float)
T1 = fd.ops.mul(T1, S1)
T1 = fd.ops.reshape(T1, new_shape=V2)
Tested performance on top of #3223
Results indicate that: (1) Should project to inputs to achieve higher performance as long as view ops won't inference with reductions. (2) If can't project to inputs, using smem persistent is still faster than regiser persistent. (green is faster than yellow markers)
(1) After inner persistent buffers are stored in shared memory. There are still bank conflicts if the persistent buffer is NOT projected to inputs due to two reasons:
(2) Can we project to inputs when there are view ops? (3) If can't project to inputs, is using smem persistent still faster than regiser persistent?