Don't use nvFuser for 1-op fusions of metadata manipulation ops

🚀 Feature

Sometimes we can end up with a fragment where a single lonesome op ends up in an nvFusion region. For example this program:

  class DynamoModule(torch.nn.Module):
    def forward(self, L_tensor_ : torch.Tensor):
        l_tensor_ = L_tensor_
        reduce = einops.einops.reduce(l_tensor_, 'b np sq sk -> (b np) sq sk', reduction = 'rearrange');  l_tensor_ = None
        return (reduce,)

produces this trace:

def computation(L_tensor_):
  # L_tensor_: "cuda:0 bf16[2, 40, 384, 384]"
  [tensor] = nvFusion0(L_tensor_)
    # tensor = prims.reshape(L_tensor_, (80, 384, 384))  # tensor: "cuda:0 bf16[80, 384, 384]"
  return {'output': (tensor,), 'flat_args': [L_tensor_], 'flat_output': (tensor,)}, ((), ())

Unfortunately it takes nvFuser 28 microseconds to do nothing:

and we probably would have been better off just running things through torch.

g51.py.txt - full sample program/benchmark.

We can (and will) do better in nvFuser on this, but there's not much of a reason to call a kernel generator to do work that it cannot possibly do better than the status quo.

Motivation

This is coming up through the ThunderFX path applied to the NeVA pretraining case.

Pitch

Ideally we'd check if a single op 'view' or 'reshape' ends up being the entirety of a fusion group and reject it if so--causing the op to fallback to the torch executor.

Alternatives

Of course, it might be better if we found a home for that single op in another fusion group, rather than just kicking it out of nvFuser entirely. It's not clear whether this case comes up in practice.

With the larger change of memory planning, we probably wouldn't need a solution as advocated here, as these single ops would either disappear (if they are aliases) or need a real copy anyway (in which case calling a kernel generator is probably about the same as not).

Additional context

cc @tfogal

Lightning-AI / lightning-thunder