NVIDIA / Fuser

A Fusion Code Generator for NVIDIA GPUs (commonly known as "nvFuser")
Other
245 stars 48 forks source link

revise max allowed vectorization factor #2519

Open liqiangxl opened 2 months ago

liqiangxl commented 2 months ago

getVectorizationFactor returns min(vect factor of vectorizable_inputs_outputs). In #2146, due to the 3 inter-segment fp32 tensors dumped from the 1st outer reduction kernel, the max vectorization factor is limited to 4. The scheduler should be able to use vect = 8 for other fp16 inputs and outputs. For fp32 outputs (and inputs if exist) we can reduce to 4 and do R/W twice.

liqiangxl commented 1 month ago

General Plan (1) Extend getVectorizationFactor to also return unordered_map<IO Tv, Tv's max vectorization factor> (2) For each scheduler do the following