Lightning-AI / lightning-thunder

Make PyTorch models up to 40% faster! Thunder is a source to source compiler for PyTorch. It enables using different hardware executors at once; across one or thousands of GPUs.
Apache License 2.0
1.07k stars 60 forks source link

TransformerEngine's FP8 LayerNorm support #658

Open tfogal opened 2 days ago

tfogal commented 2 days ago

🚀 Feature

120 Mixology runs are failing due to:

raise ValueError("LayerNorm is currently not supported by Thunder!")

Additional context

set -e
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
export NCCL_BLOCKING_WAIT=1
export TORCH_NCCL_BLOCKING_WAIT=1
export NCCL_ASYNC_ERROR_HANDLING=1
export TORCH_NCCL_ASYNC_ERROR_HANDLING=1
export NCCL_TIMEOUT=600
python -m mixology_logs.execution.main \
--nsys.enable True \
--nsys.output_path /jet/assets/recipe/-gemma-7b_-lit-gpt-pjnl_-perf-train_--eos-dgx-h100-_-bfloat16_-1_-8_--1_-train_-false_--_-thunder-cudnn_-ddp_-fp8-delayed-te_-none_-s_-lit-gpt/nsys_report \
--nsys.new_kwargs '{""--nsys_enabled"": ""True"", ""--output_dir"": ""/tmp""}' \
'{""--micro_batch_size"": ""exp_range(0, 10)""}' \
""python thunder/benchmarks/benchmark_litgpt.py \
    --max_iters 20 \
    --warmup_iters 5 \
    --output_dir /jet/logs/recipe/-gemma-7b_-lit-gpt-pjnl_-perf-train_--eos-dgx-h100-_-bfloat16_-1_-8_--1_-train_-false_--_-thunder-cudnn_-ddp_-fp8-delayed-te_-none_-s_-lit-gpt \
    --model_name Gemma-7b \
    --distributed_mode ddp \
    --shard_mode None \
    --compile thunder_cudnn \
    --checkpoint_activations False \
    --low_precision_mode fp8-delayed-te""
t-vi commented 2 days ago

But what's the traceback? I don't see that error being raised anywhere in the thunder repo...

tfogal commented 2 days ago

I don't see that error being raised anywhere in the thunder repo...

Yes, ditto. We think this is coming from the mixology scripts.

There's an internal thread with @wprazuch; stay tuned, we'll report back here. I can't seem to assign to @wprazuch (?) so assigning to me temporarily instead.

wprazuch commented 2 days ago

Thanks @tfogal for notifying! Yes, that is the functionality we introduced internally in our fork, since there was request to benchmark additionally FP8 TransformerEngine for lit-gpt, and we added that functionality. I think right now this issue is not relevant for the main repository.

But because we are speaking about this right now - I could create a PR for adding this functionality for the main repo, if you are interested in tracking and benchmarking FP8 as well. I wanted to do that some time ago, but due to other tasks I de-prioritized it. Let me know what you think about this.

tfogal commented 1 day ago

I could create a PR for adding this functionality for the main repo, if you are interested in tracking and benchmarking FP8 as well. I wanted to do that some time ago, but due to other tasks I de-prioritized it. Let me know what you think about this.

That would be great! Yes, I do think we should be tracking FP8 perf over time.

wprazuch commented 23 hours ago

I will prepare required changes then