Make PyTorch models up to 40% faster! Thunder is a source to source compiler for PyTorch. It enables using different hardware executors at once; across one or thousands of GPUs.
Apache License 2.0
1.15k
stars
77
forks
source link
Thunder's torch.compile executor may have a different performance with torch.compile #711
Note: the issue being opened here is more to let people know the existence of the difference than to require a fix.
When analyzing the microbenchmark performance of RoPE, Thunder's torch.compile executor(trace with a single TorchCompile0 fusion) sometimes performs worse than torch.compile.
Take tiny-llama-1.1b as an example:
pytest thunder/benchmarks/targets.py -k "tiny-llama-1.1b-forward-bs2-thunder+nvfuser+torch.compile] or tiny-llama-1.1b-forward-bs2-torch.compile]"
------------------------------------------------------------------------------------------------------------------------ benchmark: 2 tests -----------------------------------------------------------------------------------------------------------------------
Name (time in us) Min Max Mean StdDev Median IQR Outliers OPS (Kops/s) Rounds Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_litgpt_qkv_split_rope[tiny-llama-1.1b-forward-bs2-torch.compile] 43.4808 (1.0) 70.7150 (1.0) 45.8740 (1.0) 3.9031 (1.0) 44.7072 (1.0) 1.1832 (1.0) 144;203 21.7988 (1.0) 2300 10
test_litgpt_qkv_split_rope[tiny-llama-1.1b-forward-bs2-thunder+nvfuser+torch.compile] 222.3384 (5.11) 345.1165 (4.88) 229.5844 (5.00) 18.0420 (4.62) 224.9852 (5.03) 1.9018 (1.61) 126;210 4.3557 (0.20) 2253 2
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Note: the issue being opened here is more to let people know the existence of the difference than to require a fix.
When analyzing the microbenchmark performance of RoPE, Thunder's torch.compile executor(trace with a single TorchCompile0 fusion) sometimes performs worse than torch.compile.
Take tiny-llama-1.1b as an example:
pytest thunder/benchmarks/targets.py -k "tiny-llama-1.1b-forward-bs2-thunder+nvfuser+torch.compile] or tiny-llama-1.1b-forward-bs2-torch.compile]"
The trace of thunder:
Thunder's torch.compile executor has 4 triton kernels:
Torch.compile has 2 kernels:
The reason could be that Thunder passes the decomposed operators to torch.compile and causes the fusion to be different, so performance is different.
cc @crcrpar @apaz-cli