Llama training with FP8

@KeitaW thanks for the review! I was thinking about adding FP8 support to FSDP example, but there are two aspects why I decided to create a separate example for this:

Transformer Engine requires Nvidia's container to run(or as alternative relatively complicated process of building from source with CUDA headers, CUDNN etc). And I don't want to complicate FSDP example with it.
This example is bound to Llama model(taken from TE examples), but FSDP example supports multiple models that I don't want to rewrite with FP8.

So, in terms of importance this example is about LLama with FP8. FSDP training here is just kind of scaffolding.

aws-samples / awsome-distributed-training

Llama training with FP8 #331