Lightning-AI / lightning-thunder

Make PyTorch models up to 40% faster! Thunder is a source to source compiler for PyTorch. It enables using different hardware executors at once; across one or thousands of GPUs.

Apache License 2.0

1.07k stars 60 forks source link

[ci] : We should add a CI flow with TransformerEngine installed so that we can run the relevant tests. #196

Open kshitij12345 opened 2 months ago

kshitij12345 commented 2 months ago

We don't have a flow for testing TransformerEngine (TE) executor.

It would be great to have a CI flow with TE installed to be able to run relevant tests so that we can catch the breakages early. It can also be enabled only before merge or with a github comment.

NOTE: TE needs to be built from source at the moment. (@xwang233 knows how to set this up in docker).

cc @borda

Borda commented 2 months ago

@kshitij12345 could you please share a reference to TE and eventually how it needs to be installed?

IvanYashchuk commented 2 months ago

Here are the instructions: https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/installation.html#installation-from-source

Borda commented 2 months ago

ok, will check it later this week

kshitij12345 commented 2 months ago

As highlighted by @carmocca (thanks!), our CIs run on 3090 which don't have FP8 support. We should pursue this once the CI has GPUs with compute capability 8.9 or higher.

GPU Compute Capability Ref: https://developer.nvidia.com/cuda-gpus TE Check for FP8 support: https://github.com/NVIDIA/TransformerEngine/blob/f85553ea369da15fd726ab279818e415be48a228/transformer_engine/pytorch/fp8.py#L23-L34

Borda commented 2 months ago

Cc @t-vi @lantiga so far so I know, there is plan to change used GPU for CI

lantiga commented 4 weeks ago

Lightning-AI / lightning-thunder

[ci] : We should add a CI flow with TransformerEngine installed so that we can run the relevant tests. #196

209 for reference