Thunder and ThunderFX are slower than torch.compile for FP8 and falcon-7b and other models

🐛 Bug

As can be seen below Thunder is slower than torch.compile for single gpu training of falcon-7b:

Below are results for ThunderFX for multi-gpu training :

Batch sizes and sharding modes doesn't match, but these are the fastest options for ThunderFX:

For the first row for micro batch size the same as torch.compile (6) we get even lower throughput - 46.19
For the second row for micro batch size 7 and sharing mode zero3 we get throughput 93.5.

To Reproduce

Steps to reproduce the behavior:

python /opt/pytorch/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py \
    --model_name falcon-7b \
    --compile thunder \
    --low_precision_mode fp8-delayed-te  \
    --micro_batch_size 1

Expected behavior

Thunder should be as fast as torch.compile.

Environment

system.device_product_name DGXH100 system.gpu_driver_version 535.129.03 libraries.cuda 12.6.2.004 libraries.pip.lightning 2.4.0.dev20240728 libraries.pip.lightning-thunder 0.2.0.dev0 libraries.pip.lightning-utilities 0.11.8 libraries.pip.litgpt 0.4.11 libraries.pip.nvfuser 0.2.20+git85c22a2 libraries.pip.pytorch-lightning 2.4.0 libraries.pip.torch 2.6.0a0+git96b30dc libraries.pip.torchmetrics 1.5.1 libraries.pip.torchvision 0.19.0a0+d23a6e1

Lightning-AI / lightning-thunder