Lightning-AI / lightning-thunder

Make PyTorch models up to 40% faster! Thunder is a source to source compiler for PyTorch. It enables using different hardware executors at once; across one or thousands of GPUs.
Apache License 2.0
1.12k stars 69 forks source link

Use nvFuser executor decisions to pass on op execution to a different backend and retire hybrid `torch_compile_cat_ex` executor. #446

Open parthmannan opened 3 months ago

parthmannan commented 3 months ago

πŸš€ Feature

The feature request is to add the decision making capabilities inside nvFuser executor that allows the nvFuser executor to reject/pass on certain op executions where other backends/executors are known to have better performance.

Motivation

The motivation for this is primarily to fix the longstanding performance issue #256

In brief, the hybrid torch_compile_cat_ex executor was introduced to execute the RoPE module as it contains a bunch of concat operations that nvFuser is known to not have optimized performance for. However, we don't have any direct pattern matching capabilities yet that can map just RoPE to a different backend.

To solve that issue, the hybrid executor tries to find concat operations and tries to fuse operations around it in a single region and pass that onto Torch.Compile. This helps improve performance but in some cases (like Dollyv2, Phi, Pythia etc.), the model trace is such that this hybrid executor ends up breaking apart a potential single nvFuser region into multiple by consuming an op in the middle of the graph which makes performance worse.

Summary - Disabling the hybrid executor gives us poor performance on RoPE, enabling it gives us poor performance due to smaller nvFuser regions.

Pitch

We now have a natively integrated TorchCompile executor that is capable of taking any Thunder subgraph and generating TorchInductor optimized kernels for execution.

The pitch is for Thunder trace to pass on entire potential subgraphs to nvFuser including RoPE operations and nvFuser passes on the operations it chooses to not execute to this native TorchCompile executor. In this way, we are not relying on the hybrid executor to make good choices based on assumptions which may break from model architecture to another. This also might enable potential future choices like passing on certain ops to another backend based on tensor shapes etc.

Alternatives

  1. Fix the hybrid executor. This would involve changing the way the hybrid executor consumes operations from the Thunder trace. For example, we might assume that the output of RoPE goes into SDPA and force the hybrid executor to not consume any ops that feed into nvFuser regions or only feed into Matmuls/SDPA. Whatever we do here will be a hack that might break on a different architecture.
  2. Add a pattern matching solution for graphs (not limited to singular ops) such that we can create a pattern match replacement scheme for RoPE module to be executed by another backend like TorchCompile, APEX RoPE etc. This is a valuable feature on its own merit and serves a much broader scope. Naturally, it is also much more challenging to implement in the short term.
  3. As suggested by Tom in a slack thread, a robust autotuning algorithm (with or without ML techniques) which allows selecting the best executors for the full graph. Similar to no. 2 above, it is a much broader feature and could be more challenging to implement.

cc @apaz-cli @lantiga @tfogal @mruberry @IvanYashchuk @jjsjann123 @kevinstephano @t-vi

[Long list of CC as it touches many pieces of Thunder]

parthmannan commented 3 months ago

Another alternative option which doesn't involve nvFuser magically having the context on what is RoPE and what is the best group of ops to send to another executor -

User defined backend as a decorator.

In litGPT for example, if we have a pytorch function that does RoPE, we change the function to something like

@thunder.jit(force_backend=torch_compile_executor)
def apply_rope(inputs):
........

so when thunder is constructing the trace, any operations traced inside this function are pushed to the forced executor backend.

Pros:

Cons:

mruberry commented 2 months ago

triage review β€”