Lightning-AI / lightning-thunder

Make PyTorch models up to 40% faster! Thunder is a source to source compiler for PyTorch. It enables using different hardware executors at once; across one or thousands of GPUs.
Apache License 2.0
1.18k stars 77 forks source link

Thunder and ThunderFX are slower than torch.compile for FP8 and falcon-7b and other models #1365

Open mpatel31415 opened 2 days ago

mpatel31415 commented 2 days ago

🐛 Bug

As can be seen below Thunder is slower than torch.compile for single gpu training of falcon-7b:

image

Below are results for ThunderFX for multi-gpu training : image

Batch sizes and sharding modes doesn't match, but these are the fastest options for ThunderFX:

To Reproduce

Steps to reproduce the behavior:

python /opt/pytorch/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py \
    --model_name falcon-7b \
    --compile thunder \
    --low_precision_mode fp8-delayed-te  \
    --micro_batch_size 1

Expected behavior

Thunder should be as fast as torch.compile.

Environment

system.device_product_name DGXH100 system.gpu_driver_version 535.129.03 libraries.cuda 12.6.2.004 libraries.pip.lightning 2.4.0.dev20240728 libraries.pip.lightning-thunder 0.2.0.dev0 libraries.pip.lightning-utilities 0.11.8 libraries.pip.litgpt 0.4.11 libraries.pip.nvfuser 0.2.20+git85c22a2 libraries.pip.pytorch-lightning 2.4.0 libraries.pip.torch 2.6.0a0+git96b30dc libraries.pip.torchmetrics 1.5.1 libraries.pip.torchvision 0.19.0a0+d23a6e1

mpatel31415 commented 2 days ago

Actually we see the same results for other models. Is one issue enough to track all of them? Below are the results:

image