Closed parthmannan closed 5 months ago
Okay so this is not exactly completely well understood by me yet but I am going to put out my thoughts here -
The performance penalty is largely coming from the backward pass where the parallel computation of GeLU is creating some interesting behavior. Without diving too deep into the derivatives, here's dGeLU computation as I understand it and I am separating portions below.
dGeLU(x) = (erf(x/1.4) + 1)/2 * dy + x*PDF(X=x)*dy
= A + (((2 * dy * x * f339)/1.77) * exp(-(x/1.4)^2)) / f337
= A + (B * C) / f337
Now the Torch.compile hybridized executor trace looks very interesting here (trace attached for a single layer of the network) -
All of this is a single fusion block when we are not using Torch.compile executor. Perhaps, this happens because TorchCompile0 takes away the B*C
computation before nvFuser can form blocks around the computation?
Now both nvFusion2 and nvFusion3 are doing some other computation as well so the performance penalty may be not as high (idk yet) but there is definitely some extra memory transfers happening here due to the pass around of all these separate computations and creating worse performance.
@IvanYashchuk @mruberry @tfogal - Maybe it is too early in the analysis to get your thoughts but this felt a little crucial to Torch.compile executor perf on other models.
Perhaps, this happens because TorchCompile0 takes away the B*C computation before nvFuser can form blocks around the computation?
Yes, that's precisely what's happening. The partitioner is too aggressive and for the TorchCompile region the rule is "find a cat
operation then expand the fusion group to all supported operations (reshapes, slices, add, mul) dataflow can reach" and this now destroys the fusion opportunity for nvFuser because there's no communication between the two.
If we changed the order of executor to place nvFuser executor before torch.compile executor then I think what would happen is that we would see a single fusion block for dGeLU with nvFuser and just cat
would be sent to the TorchCompile region.
There are bugs in the TorchCompile partitioner logic because from the trace txt file t1008
should not have been put into TorchCompile0, it's not used there. If we fix this then there should be a single nvFuser region created.
This is probably not a super high priority item but I am assuming people will try the executor on different models and could potentially see worse performance? Do you already have an idea on how to fix the TorchCompile partitioner not picking up nodes that don't directly form a part of its neighborhood computation graph?
@IvanYashchuk Is this issue okay to move to the new open source repo?
Yes, I think it's okay to move the new repo and it's important to not forger about this problem. I don't have concrete ideas on how to fix the partitioner.
Checking the performance today I see the following: Thunder+TorchCompileCat: 247.54 ms Thunder: 251.69 ms
torchrun --nproc_per_node=8 thunder/benchmarks/benchmark_litgpt.py --compile=thunder_inductor_cat --distributed_mode=fsdp --nsys_enabled=False --micro_batch_size=1 --global_batch_size=8 --model_name=pythia-6.9b
torchrun --nproc_per_node=8 thunder/benchmarks/benchmark_litgpt.py --compile=thunder --distributed_mode=fsdp --nsys_enabled=False --micro_batch_size=1 --global_batch_size=8 --model_name=pythia-6.9b
The numbers are worse than in Feb, but TorchCompileCat brings some value here, but that could be also due to regressions in fsdp. The major difference, I think, from Feb is that nvFuser now fuses cat
since https://github.com/Lightning-AI/lightning-thunder/pull/35.
Here are the numbers for single GPU execution, TorchCompileCat still improves performance a bit: Thunder+TorchCompileCat: 208.25 ms Thunder: 211.72 ms
python thunder/benchmarks/benchmark_litgpt.py --compile=thunder_inductor_cat --micro_batch_size=1 --model_name=pythia-6.9b
python thunder/benchmarks/benchmark_litgpt.py --compile=thunder --micro_batch_size=1 --model_name=pythia-6.9b
TorchCompileCat is a hack for executing RoPE fusions, it's by accident that this executor also claims backward of torch.split
(that is a cat
op) and consequently breaks nvFuser fusions. TorchCompileCat should be further constrained to be applied just for RoPE. And we should reevaluate its performance on a wider range of models and microbatch sizes.
Let's try constraining it a bit:
diff --git a/thunder/executors/torch_compile.py b/thunder/executors/torch_compile.py
index 1e18c42e..3f20136c 100644
--- a/thunder/executors/torch_compile.py
+++ b/thunder/executors/torch_compile.py
@@ -198,10 +198,10 @@ from thunder.executors.torchex import ex as pytorch_ex
# since they would be competing over fusion opportunities. The advantage over simply doing `torch.compile` is that you
# still get all of Thunder's advantages, like enabling custom executors (e.g. with custom triton kernels) before it.
required_ops = {
- "torch.cat",
+ #"torch.cat",
prims.cat.id,
- prims.pad.id,
- prims.slice_prim.id,
+ #prims.pad.id,
+ #prims.slice_prim.id,
}
torch_compile_cat_ex = TorchCompileExecutor(name="torchcompile_cat", required_ops=required_ops)
register_executor(torch_compile_cat_ex)
Here's how the forward trace is affected:
phi_a
is not saved for backward anymore
https://github.com/Lightning-AI/lightning-thunder/blob/3034ef9efad3db43f9e94e3e9013ce853c0b3680/thunder/torch/__init__.py#L1426Here's how the backward trace is affected:
Let's check how this change impacts our current microbenchmark (which should be revised and updated!):
pytest thunder/benchmarks/targets.py -k "test_llama2_qkv_split_rope_7b_train[thunder+nvfuser+torch.compile]"
The results are worse with this change on Llama 2 7B arch: 8.6 ms vs 19.5 ms, and for Pythia 6.9B it's 4.4 ms vs 9.8 ms. Running Pythia 6.9B on a single GPU with or without this change doesn't have an impact on perf though.
That's quite interesting.
The major difference, I think, from Feb is that nvFuser now fuses cat since https://github.com/Lightning-AI/lightning-thunder/pull/35.
Do we know if nvFuser enabling cat is just functionally good or performant as well?
I believe TorchCompileCat executor was added for concats specifically so using that without cat
is probably expected to be worse if the nvFuser support isn't as performant.
I can look into that. But even if thunder_inductor_cat
improves perf a little bit, we may be leaving perf on the table as it improves RoPE
perf but degrades other aspects of the trace so there should still be similar room for improvement.
I also just dived into the trace again to see if the behavior is the same but it has changed. And this is because the Thunder trace itself has changed a little bit. Rolling back to a previous comment (see above for the calculation) -
dGeLU(x) = A + (B * C) / f337
Earlier we had, nvFusion compute B, C separately. TorchCompile compute B*C and then nvFuser compute the final output leading to 4 different fusion regions to compute this.
Now we have, nvFusion computes B, C and BC in a single region and TorchCompile only does A + BC and has the final output. Somehow, with the change in the Thunder trace - we don't see the same breaks in the nvFuser regions and we see only 2 fusion regions now. This is why the performance on Pythia looks decent now. This is quite interesting and I am thinking I have to re-analyze other networks like Phi and Dolly and confirm whether the TorchCompile cat executor is still a perf issue there.
Do we know if nvFuser enabling cat is just functionally good or performant as well?
I don't know it and I've created https://github.com/Lightning-AI/lightning-thunder/pull/479 to simplify answering this question.
@kiya00, could you please help identify what are the best executor options (nvFuser/Inductor/Apex) separately for the forward and backward of this region on the range of all model configurations from LitGPT? The relevant benchmark in question is this one https://github.com/Lightning-AI/lightning-thunder/blob/d1b016a58a48e5c6282622de488be8c9135dd821/thunder/benchmarks/targets.py#L535
pytest thunder/benchmarks/targets.py -k "test_litgpt_qkv_split_rope" --benchmark-group-by='param:config,param:bs,param:compute_type'
there's also an environment variable to launch more benchmarks https://github.com/Lightning-AI/lightning-thunder/blob/d1b016a58a48e5c6282622de488be8c9135dd821/thunder/benchmarks/targets.py#L54
Hi @IvanYashchuk , Here are some microbenchmark results for all LitGPT configurations separately for forward and backward with different executor options (1404 items), but in almost all the cases torch.comple
seems better, I'm not very clear about the background of this issue, should this benchmark have better perf using thunder+nvfuser+torch.compile
at least on Pythia?
I print out the forward/backward trace of test_litgpt_qkv_split_rope[pythia-1.4b-backward-bs1-thunder+nvfuser+torch.compile]
, they both have only one TorchCompile0
region.
(the trace is on thunder: 69e80f0a094376576a39306f62b9c510138e41fa, the pref log is a few days old, on thunder: d1d581c401fb201d2f181c66bdc4281cf616c935)
I'm not very clear about the background of this issue, should this benchmark have better perf using thunder+nvfuser+torch.compile at least on Pythia?
Thunder (and thunder+nvfuser+torch.compile) should have better performance for all cases. The purpose of these benchmarks is to evaluate the current situation and identify what needs to be done to improve performance for worst performing cases. Besides the logs it would be useful to have a script that analyzes the json results from pytest-benchmark and creates a summary:
Include summary of any other important information that is the json files. A table could be useful, something like:
Metric | Batch Size 1 | Batch Size 2 |
---|---|---|
Top Executor | Executor A | Executor B |
Percentage of Configs Best for Executor | ||
- Executor A | 60% | 50% |
- Executor B | 30% | 40% |
- Executor C | 10% | 10% |
Gap Between Top Executor and Thunder | ||
- Max Gap | 15 ms | 20 ms |
- Min Gap | 1 ms | 2 ms |
- Mean Gap | 8 ms | 10 ms |
- Median Gap | 7 ms | 9 ms |
If the batchsize is increased how do the results change? Maybe the overheads are too large for Thunder? How does the patch from https://github.com/Lightning-AI/lightning-thunder/issues/256#issuecomment-2136096575 affects numbers of thunder+nvfuser+torch.compile
executor?
Maybe it's possible to get pure CUDA kernel times with a timer from nvFuser https://github.com/NVIDIA/Fuser/blob/18750278f9f20a817808dc1c63c0fb6962d37c9c/benchmarks/python/core.py#L209-L229.
issue256_analysis.xlsx
Here is some initial information for reference (based on container 0614)
when looking at one test case test_litgpt_qkv_split_rope[pythia-1.4b-backward-bs1-thunder+nvfuser+torch.compile]
, the trace has one single TorchCompile0
, but the mean time of thunder+nvfuser+torch.compile(148.1826) is much worse than torch.compile(108.1542),
This is quite interesting and I am thinking I have to re-analyze other networks like Phi and Dolly and confirm whether the TorchCompile cat executor is still a perf issue there.
Hi @parthmannan , checking the performance today for the original problem with commands(on H100, container0626)
torchrun --nproc_per_node=8 thunder/benchmarks/benchmark_litgpt.py --compile=thunder_inductor_cat --distributed_mode=fsdp --nsys_enabled=False --micro_batch_size=1 --global_batch_size=8 --model_name=pythia-6.9b
torchrun --nproc_per_node=8 thunder/benchmarks/benchmark_litgpt.py --compile=thunder --distributed_mode=fsdp --nsys_enabled=False --micro_batch_size=1 --global_batch_size=8 --model_name=pythia-6.9b
python thunder/benchmarks/benchmark_litgpt.py --compile=thunder_inductor_cat --micro_batch_size=1 --model_name=pythia-6.9b
python thunder/benchmarks/benchmark_litgpt.py --compile=thunder --micro_batch_size=1 --model_name=pythia-6.9b
pythia-6.9b | zero2 | single gpu |
---|---|---|
thunder_inductor_cat | 234.84 ms | 208.45 ms |
thunder | 233.02 ms | 210.97 ms |
phi-2 | zero2 | single gpu |
---|---|---|
thunder_inductor_cat | 114.54 ms | 104.35 ms |
thunder | 113.44 ms | 105.14 ms |
dolly-v2-7b | zero2 | single gpu |
---|---|---|
thunder_inductor_cat | 233.81 ms | 208.45 ms |
thunder | 233.26 ms | 211.03 ms |
The performance of thunder_inductor_cat exe seems to be decent now. I'll dig into the reason why the RoPE microbenchmark is not what we expected, but for the original problem in this issue, do we want more analysis on that?
cc: @IvanYashchuk
It's great that "thunder_inductor_cat" is now better for Phi-2 and Dolly, thank you for rerunning the benchmarks! I'm inclined to close this particular issue and start a new one specifically for RoPE microbenchmark performance. We need to understand better how impactful improvements for that microbenchmark for full network runs.
Given this data, we can definitely close this issue. Thanks so much @kiya00 for re-running this and the analysis. Just curious, do we know what changed in the partitioning logic that the performance issues are gone now?
No, I don't know what was changed there.
🐛 Bug
The performance of using the hybridized torch.compile executor w/ Thunder is worse than plain Thunder on Pythia models. These set of models differ from LLaMa architecture in few main ways -
Example performance on H100 Single Node FP16 for Pythia6.9B, MBS=1, GBS=8, FSDP ZeRO2 w/o bucketing Thunder iteration time (ms) = 232.74 ms Thunder + torch.compile iteration time (ms) = 239.23 ms
cc @crcrpar @apaz-cli