facebookresearch / xformers

Hackable and optimized Transformers building blocks, supporting a composable construction.
https://facebookresearch.github.io/xformers/
Other
8.44k stars 600 forks source link

Does xformers support FP8? #1058

Open Godlovecui opened 3 months ago

Godlovecui commented 3 months ago

🚀 Feature

FP8 is very useful in training or inference in LLM. Does xformers support FP8? Thank you~

danthe3rd commented 3 months ago

Hi, We have some very minimal support at the moment. The attention does not support it for now, but fused sequence parallel supports it for instance.

Godlovecui commented 3 months ago

Hi, @danthe3rd: Do you think what is the key difficult in full support FP8? Or, do you have any schedule for full support in the future? Thank you~

danthe3rd commented 3 months ago

What support do you need for fp8? xFormers contains a lot of operators, wondering which one you are thinking about in particular?

Godlovecui commented 3 months ago

Hi, @danthe3rd I think self-attention is the key module in LLM. We can feed input data of FP8 into self-attention module, thus it will use FP8 tensor core for accelerating. Is it possible for realizing the fused self-attention kernel in FP8. Thank you~

danthe3rd commented 3 months ago

This is something that will eventually come into xFormers, but not in the very short term. Also I'm curious if you have any data to share regarding numerics for fp8 attention: does it preserve the model accuracy?

Godlovecui commented 3 months ago

This article has some research on FP8 in flashAttention2, https://research.colfax-intl.com/adding-fp8-to-flashattention/ and cuda kernel: https://github.com/ColfaxResearch/cutlass-kernels Maybe some small sacrifice in accuracy is acceptable in FP8. Thank you~