Lightning-AI / lightning-thunder

Make PyTorch models up to 40% faster! Thunder is a source to source compiler for PyTorch. It enables using different hardware executors at once; across one or thousands of GPUs.
Apache License 2.0
1.22k stars 82 forks source link

Don't use nvFuser for 1-op fusions of metadata manipulation ops #1251

Open tfogal opened 2 months ago

tfogal commented 2 months ago

🚀 Feature

Sometimes we can end up with a fragment where a single lonesome op ends up in an nvFusion region. For example this program:

  class DynamoModule(torch.nn.Module):
    def forward(self, L_tensor_ : torch.Tensor):
        l_tensor_ = L_tensor_
        reduce = einops.einops.reduce(l_tensor_, 'b np sq sk -> (b np) sq sk', reduction = 'rearrange');  l_tensor_ = None
        return (reduce,)

produces this trace:

def computation(L_tensor_):
  # L_tensor_: "cuda:0 bf16[2, 40, 384, 384]"
  [tensor] = nvFusion0(L_tensor_)
    # tensor = prims.reshape(L_tensor_, (80, 384, 384))  # tensor: "cuda:0 bf16[80, 384, 384]"
  return {'output': (tensor,), 'flat_args': [L_tensor_], 'flat_output': (tensor,)}, ((), ())

Unfortunately it takes nvFuser 28 microseconds to do nothing: image

and we probably would have been better off just running things through torch.

g51.py.txt - full sample program/benchmark.

We can (and will) do better in nvFuser on this, but there's not much of a reason to call a kernel generator to do work that it cannot possibly do better than the status quo.

Motivation

This is coming up through the ThunderFX path applied to the NeVA pretraining case.

Pitch

Ideally we'd check if a single op 'view' or 'reshape' ends up being the entirety of a fusion group and reject it if so--causing the op to fallback to the torch executor.

Alternatives

Of course, it might be better if we found a home for that single op in another fusion group, rather than just kicking it out of nvFuser entirely. It's not clear whether this case comes up in practice.

With the larger change of memory planning, we probably wouldn't need a solution as advocated here, as these single ops would either disappear (if they are aliases) or need a real copy anyway (in which case calling a kernel generator is probably about the same as not).

Additional context

cc @tfogal

tfogal commented 1 day ago

cc @kshitij12345 since i mentioned it.

don't go out of your way to check, but if you find this is relevant for Q4 models then please send me a ping.