NVIDIA / Fuser

A Fusion Code Generator for NVIDIA GPUs (commonly known as "nvFuser")
Other
271 stars 53 forks source link

Tracks performance issues related to inner outer persistent scheduler #3272

Open liqiangxl opened 3 weeks ago

liqiangxl commented 3 weeks ago

(1) After inner persistent buffers are stored in shared memory. There are still bank conflicts if the persistent buffer is NOT projected to inputs due to two reasons:

(a) We are missing a cacheBefore, to ensure vectorized write to shared memory.
(b) After adding vectorized read and write, if inputs are vectorized by 8, the innermost dim is 8, but can only be vectorized by 4 if the data type is fp32, and there is still a 2-way bank conflict.

(2) Can we project to inputs when there are view ops? (3) If can't project to inputs, is using smem persistent still faster than regiser persistent?

liqiangxl commented 3 weeks ago

Some explotary observations: Added a redundant reshape before ln bwd

    # reshape T1 to 3D tensor, multiply by 1, reshape back to 2D tensor
    # this reshape still allows scheduler to project buffer to inputs
    # but current heuristics disabled the projection and leads to lower performance
    G0 = fd.define_scalar(256, dtype=DataType.Int)
    C0 = fd.ops.div(T0.size(1), G0)    
    V1 = fd.define_vector([T1.size(0), C0, G0], dtype=DataType.Int)
    V2 = fd.define_vector([T1.size(0), T1.size(1)], dtype=DataType.Int)
    T1 = fd.ops.reshape(T1, new_shape=V1)
    S1 = fd.define_scalar(1.0, dtype=DataType.Float)
    T1 = fd.ops.mul(T1, S1)
    T1 = fd.ops.reshape(T1, new_shape=V2)

Tested performance on top of #3223 Image

Results indicate that: (1) Should project to inputs to achieve higher performance as long as view ops won't inference with reductions. (2) If can't project to inputs, using smem persistent is still faster than regiser persistent. (green is faster than yellow markers)