Open parthmannan opened 3 months ago
Another alternative option which doesn't involve nvFuser magically having the context on what is RoPE and what is the best group of ops to send to another executor -
In litGPT for example, if we have a pytorch function that does RoPE, we change the function to something like
@thunder.jit(force_backend=torch_compile_executor)
def apply_rope(inputs):
........
so when thunder is constructing the trace, any operations traced inside this function are pushed to the forced executor backend.
Pros:
Cons:
triage review β
π Feature
The feature request is to add the decision making capabilities inside nvFuser executor that allows the nvFuser executor to reject/pass on certain op executions where other backends/executors are known to have better performance.
Motivation
The motivation for this is primarily to fix the longstanding performance issue #256
In brief, the hybrid
torch_compile_cat_ex
executor was introduced to execute the RoPE module as it contains a bunch of concat operations that nvFuser is known to not have optimized performance for. However, we don't have any direct pattern matching capabilities yet that can map just RoPE to a different backend.To solve that issue, the hybrid executor tries to find concat operations and tries to fuse operations around it in a single region and pass that onto Torch.Compile. This helps improve performance but in some cases (like Dollyv2, Phi, Pythia etc.), the model trace is such that this hybrid executor ends up breaking apart a potential single nvFuser region into multiple by consuming an op in the middle of the graph which makes performance worse.
Summary - Disabling the hybrid executor gives us poor performance on RoPE, enabling it gives us poor performance due to smaller nvFuser regions.
Pitch
We now have a natively integrated TorchCompile executor that is capable of taking any Thunder subgraph and generating TorchInductor optimized kernels for execution.
The pitch is for Thunder trace to pass on entire potential subgraphs to nvFuser including RoPE operations and nvFuser passes on the operations it chooses to not execute to this native TorchCompile executor. In this way, we are not relying on the hybrid executor to make good choices based on assumptions which may break from model architecture to another. This also might enable potential future choices like passing on certain ops to another backend based on tensor shapes etc.
Alternatives
cc @apaz-cli @lantiga @tfogal @mruberry @IvanYashchuk @jjsjann123 @kevinstephano @t-vi
[Long list of CC as it touches many pieces of Thunder]