Tracks performance issues related to inner outer persistent scheduler

NVIDIA / Fuser

A Fusion Code Generator for NVIDIA GPUs (commonly known as "nvFuser")

Other

271 stars 53 forks source link

(a) We are missing a cacheBefore, to ensure vectorized write to shared memory. (b) After adding vectorized read and write, if inputs are vectorized by 8, the innermost dim is 8, but can only be vectorized by 4 if the data type is fp32, and there is still a 2-way bank conflict.

Some explotary observations: Added a redundant reshape before ln bwd

    # reshape T1 to 3D tensor, multiply by 1, reshape back to 2D tensor
    # this reshape still allows scheduler to project buffer to inputs
    # but current heuristics disabled the projection and leads to lower performance
    G0 = fd.define_scalar(256, dtype=DataType.Int)
    C0 = fd.ops.div(T0.size(1), G0)    
    V1 = fd.define_vector([T1.size(0), C0, G0], dtype=DataType.Int)
    V2 = fd.define_vector([T1.size(0), T1.size(1)], dtype=DataType.Int)
    T1 = fd.ops.reshape(T1, new_shape=V1)
    S1 = fd.define_scalar(1.0, dtype=DataType.Float)
    T1 = fd.ops.mul(T1, S1)
    T1 = fd.ops.reshape(T1, new_shape=V2)

Tested performance on top of #3223

Results indicate that: (1) Should project to inputs to achieve higher performance as long as view ops won't inference with reductions. (2) If can't project to inputs, using smem persistent is still faster than regiser persistent. (green is faster than yellow markers)

NVIDIA / Fuser

Tracks performance issues related to inner outer persistent scheduler #3272