NVIDIA / Fuser

A Fusion Code Generator for NVIDIA GPUs (commonly known as "nvFuser")
Other
271 stars 53 forks source link

Avoid canScheduleCompileTime for dynamic shape check #3436

Closed jacobhinkle closed 1 day ago

jacobhinkle commented 3 days ago

This PR plumbs through an option to skip the call to canScheduleCompileTime in Schedule::canSchedule, allowing us to avoid this check when getting heuristics for new dynamic shapes. As mentioned in https://github.com/NVIDIA/Fuser/issues/3419#issuecomment-2479956772 this gives us a sizeable speedup in most cases.

Before this PR: image

After this PR: image

This is related to #3419, but until we address the many-segments latency I will refrain from closing that issue. EDIT: the steady host latency for many-segments is 340 us, so getting dynamic latency down to 1400 us makes it about 3x steady. This matches the other two tests: many pointwise ops (steady=43 us, dynamic=135 us) and adaptive layernorm (steady=71 us, dynamic=222 us). So in general we now have dynamic latency of about 3x steady latency.

Fixes #3419

jacobhinkle commented 3 days ago

!test

jacobhinkle commented 2 days ago

H100 CI runners must have failed or something. This change is not arch-specific though and all other tests are passing.

jacobhinkle commented 2 days ago

!build

jacobhinkle commented 1 day ago

!build