aws-samples / awsome-distributed-training

Collection of best practices, reference architectures, model training examples and utilities to train large models on AWS.
MIT No Attribution
177 stars 74 forks source link

Llama training with FP8 #331

Open pbelevich opened 4 months ago

pbelevich commented 4 months ago

@KeitaW thanks for the review! I was thinking about adding FP8 support to FSDP example, but there are two aspects why I decided to create a separate example for this:

  1. Transformer Engine requires Nvidia's container to run(or as alternative relatively complicated process of building from source with CUDA headers, CUDNN etc). And I don't want to complicate FSDP example with it.
  2. This example is bound to Llama model(taken from TE examples), but FSDP example supports multiple models that I don't want to rewrite with FP8.

So, in terms of importance this example is about LLama with FP8. FSDP training here is just kind of scaffolding.

sbhavani commented 4 months ago

@pbelevich FYI the AWS DLC for PyTorch also includes TE