ROCm / triton

Development repository for the Triton language and compiler
MIT License
80 stars 22 forks source link

Added a reference implementation with fake quantization. #528

Open wenchenvincent opened 4 months ago

wenchenvincent commented 4 months ago

The benefit of this reference compared with the f16 reference is that this reference has already included the quantization error from fp8 quantization. So the error between this fake quantization reference and the flash attention kernel is mainly caused by the rearrangement of compute operations in flash attention. So if there is a large error, it is likely caused by an implementation bug rather than quantization error.