float8 matmul for inference + torchao fp8 training

Torch has support for float8 matmul kernels, and it seems like they are faster than bf16 on Ada and above architectures. TorchAO supports training in fp8. This has been explored in a few newer optimization examples of Flux and other larger models to achieve real-time image generation. I think we could explore this for training in CogVideoX and see how it pans out.

Relevant links:

https://github.com/pytorch/ao/tree/main/torchao/float8
https://github.com/aredden/flux-fp8-api/blob/0a914297d17f58f64214e1f99d8d3cfb9e791d3e/float8_quantize.py#L284
https://gist.github.com/sayakpaul/f0358dd4f4bcedf14211eba5704df25a

Since this might take some time to profile properly, it is low priority but definitely worth exploring since some other training libraries/UIs are exploring into this too.

@sayakpaul @zRzRzRzRzRzRzR

a-r-r-o-w / cogvideox-factory

float8 matmul for inference + torchao fp8 training #28