Added a reference implementation with fake quantization.

The benefit of this reference compared with the f16 reference is that this reference has already included the quantization error from fp8 quantization. So the error between this fake quantization reference and the flash attention kernel is mainly caused by the rearrangement of compute operations in flash attention. So if there is a large error, it is likely caused by an implementation bug rather than quantization error.

ROCm / triton

Added a reference implementation with fake quantization. #528