Use nvFuser executor decisions to pass on op execution to a different backend and retire hybrid `torch_compile_cat_ex` executor.

parthmannan commented 3 months ago

🚀 Feature

The feature request is to add the decision making capabilities inside nvFuser executor that allows the nvFuser executor to reject/pass on certain op executions where other backends/executors are known to have better performance.

Motivation

The motivation for this is primarily to fix the longstanding performance issue #256

In brief, the hybrid torch_compile_cat_ex executor was introduced to execute the RoPE module as it contains a bunch of concat operations that nvFuser is known to not have optimized performance for. However, we don't have any direct pattern matching capabilities yet that can map just RoPE to a different backend.

To solve that issue, the hybrid executor tries to find concat operations and tries to fuse operations around it in a single region and pass that onto Torch.Compile. This helps improve performance but in some cases (like Dollyv2, Phi, Pythia etc.), the model trace is such that this hybrid executor ends up breaking apart a potential single nvFuser region into multiple by consuming an op in the middle of the graph which makes performance worse.

Summary - Disabling the hybrid executor gives us poor performance on RoPE, enabling it gives us poor performance due to smaller nvFuser regions.

Pitch

We now have a natively integrated TorchCompile executor that is capable of taking any Thunder subgraph and generating TorchInductor optimized kernels for execution.

The pitch is for Thunder trace to pass on entire potential subgraphs to nvFuser including RoPE operations and nvFuser passes on the operations it chooses to not execute to this native TorchCompile executor. In this way, we are not relying on the hybrid executor to make good choices based on assumptions which may break from model architecture to another. This also might enable potential future choices like passing on certain ops to another backend based on tensor shapes etc.

Alternatives

Fix the hybrid executor. This would involve changing the way the hybrid executor consumes operations from the Thunder trace. For example, we might assume that the output of RoPE goes into SDPA and force the hybrid executor to not consume any ops that feed into nvFuser regions or only feed into Matmuls/SDPA. Whatever we do here will be a hack that might break on a different architecture.
Add a pattern matching solution for graphs (not limited to singular ops) such that we can create a pattern match replacement scheme for RoPE module to be executed by another backend like TorchCompile, APEX RoPE etc. This is a valuable feature on its own merit and serves a much broader scope. Naturally, it is also much more challenging to implement in the short term.
As suggested by Tom in a slack thread, a robust autotuning algorithm (with or without ML techniques) which allows selecting the best executors for the full graph. Similar to no. 2 above, it is a much broader feature and could be more challenging to implement.

cc @apaz-cli @lantiga @tfogal @mruberry @IvanYashchuk @jjsjann123 @kevinstephano @t-vi

[Long list of CC as it touches many pieces of Thunder]

parthmannan commented 3 months ago

Another alternative option which doesn't involve nvFuser magically having the context on what is RoPE and what is the best group of ops to send to another executor -

User defined backend as a decorator.

In litGPT for example, if we have a pytorch function that does RoPE, we change the function to something like

@thunder.jit(force_backend=torch_compile_executor)
def apply_rope(inputs):
........

so when thunder is constructing the trace, any operations traced inside this function are pushed to the forced executor backend.

Pros:

User gets control over execution
Solves pattern matching for at least semi-ninja users. users can do this for anything and use custom kernels registered as backends in Thunder.
Alleviates our RoPE perf issues

Cons:

Adds complexity to user behavior and usage.
Might force changes into how traces are constructed and transforms are done.

mruberry commented 2 months ago

triage review —

it would be good to talk through the issue and possible solutions during a technical discussion
@kiya00 is gathering some additional benchmarking data that may inform our thinking here
https://github.com/Lightning-AI/lightning-thunder/blob/main/notebooks/zero_to_thunder.ipynb has an extensibility section that may inform the rope mapping to execution (and how to divert it)

Lightning-AI / lightning-thunder