This PR plumbs through an option to skip the call to canScheduleCompileTime in Schedule::canSchedule, allowing us to avoid this check when getting heuristics for new dynamic shapes. As mentioned in https://github.com/NVIDIA/Fuser/issues/3419#issuecomment-2479956772 this gives us a sizeable speedup in most cases.
Before this PR:
After this PR:
This is related to #3419, but until we address the many-segments latency I will refrain from closing that issue. EDIT: the steady host latency for many-segments is 340 us, so getting dynamic latency down to 1400 us makes it about 3x steady. This matches the other two tests: many pointwise ops (steady=43 us, dynamic=135 us) and adaptive layernorm (steady=71 us, dynamic=222 us). So in general we now have dynamic latency of about 3x steady latency.
This PR plumbs through an option to skip the call to
canScheduleCompileTime
inSchedule::canSchedule
, allowing us to avoid this check when getting heuristics for new dynamic shapes. As mentioned in https://github.com/NVIDIA/Fuser/issues/3419#issuecomment-2479956772 this gives us a sizeable speedup in most cases.Before this PR:
After this PR:
This is related to #3419, but until we address the many-segments latency I will refrain from closing that issue.EDIT: the steady host latency for many-segments is 340 us, so getting dynamic latency down to 1400 us makes it about 3x steady. This matches the other two tests: many pointwise ops (steady=43 us, dynamic=135 us) and adaptive layernorm (steady=71 us, dynamic=222 us). So in general we now have dynamic latency of about 3x steady latency.Fixes #3419