Open Godlovecui opened 3 months ago
Hi, We have some very minimal support at the moment. The attention does not support it for now, but fused sequence parallel supports it for instance.
Hi, @danthe3rd: Do you think what is the key difficult in full support FP8? Or, do you have any schedule for full support in the future? Thank you~
What support do you need for fp8? xFormers contains a lot of operators, wondering which one you are thinking about in particular?
Hi, @danthe3rd I think self-attention is the key module in LLM. We can feed input data of FP8 into self-attention module, thus it will use FP8 tensor core for accelerating. Is it possible for realizing the fused self-attention kernel in FP8. Thank you~
This is something that will eventually come into xFormers, but not in the very short term. Also I'm curious if you have any data to share regarding numerics for fp8 attention: does it preserve the model accuracy?
This article has some research on FP8 in flashAttention2, https://research.colfax-intl.com/adding-fp8-to-flashattention/ and cuda kernel: https://github.com/ColfaxResearch/cutlass-kernels Maybe some small sacrifice in accuracy is acceptable in FP8. Thank you~
🚀 Feature
FP8 is very useful in training or inference in LLM. Does xformers support FP8? Thank you~