Make PyTorch models up to 40% faster! Thunder is a source to source compiler for PyTorch. It enables using different hardware executors at once; across one or thousands of GPUs.
Apache License 2.0
1.22k
stars
82
forks
source link
Don't use nvFuser for 1-op fusions of metadata manipulation ops #1251
We can (and will) do better in nvFuser on this, but there's not much of a reason to call a kernel generator to do work that it cannot possibly do better than the status quo.
Motivation
This is coming up through the ThunderFX path applied to the NeVA pretraining case.
Pitch
Ideally we'd check if a single op 'view' or 'reshape' ends up being the entirety of a fusion group and reject it if so--causing the op to fallback to the torch executor.
Alternatives
Of course, it might be better if we found a home for that single op in another fusion group, rather than just kicking it out of nvFuser entirely. It's not clear whether this case comes up in practice.
With the larger change of memory planning, we probably wouldn't need a solution as advocated here, as these single ops would either disappear (if they are aliases) or need a real copy anyway (in which case calling a kernel generator is probably about the same as not).
🚀 Feature
Sometimes we can end up with a fragment where a single lonesome op ends up in an nvFusion region. For example this program:
produces this trace:
Unfortunately it takes nvFuser 28 microseconds to do nothing:
and we probably would have been better off just running things through torch.
g51.py.txt - full sample program/benchmark.
We can (and will) do better in nvFuser on this, but there's not much of a reason to call a kernel generator to do work that it cannot possibly do better than the status quo.
Motivation
This is coming up through the ThunderFX path applied to the NeVA pretraining case.
Pitch
Ideally we'd check if a single op 'view' or 'reshape' ends up being the entirety of a fusion group and reject it if so--causing the op to fallback to the torch executor.
Alternatives
Of course, it might be better if we found a home for that single op in another fusion group, rather than just kicking it out of nvFuser entirely. It's not clear whether this case comes up in practice.
With the larger change of memory planning, we probably wouldn't need a solution as advocated here, as these single ops would either disappear (if they are aliases) or need a real copy anyway (in which case calling a kernel generator is probably about the same as not).
Additional context
cc @tfogal