Dao-AILab / flash-attention

Fast and memory-efficient exact attention
BSD 3-Clause "New" or "Revised" License
13.63k stars 1.25k forks source link

block scaling support not found #1134

Open complexfilter opened 1 month ago

complexfilter commented 1 month ago

The FA3 paper says:

Accuracy: block quantization and incoherent processing. With FP8 (e4m3) format, one only uses 3 bits to store the mantissa and 4 bits for the exponent. This results in higher numerical error than FP16/BF16. Moreover, large models typically have outlier values [20, 54] that are much larger in magnitude than most other values, making quantization difficult. One typically use per-tensor scaling [37] by keeping one scalar per tensor (e.g., one for Q, for K, and for V). To reduce the numerical error of attention in FP8, we employ two techniques: 1. Block quantization: we keep one scalar per block, so that for each of Q, K, V we split the tensor into blocks of size π΅π‘Ÿ Γ— 𝑑 or 𝐡𝑐 Γ— 𝑑 and quantize them separately. This quantization can be fused with an operation right before attention (e.g., rotary embedding) with no additional slow down (since rotary embedding is memory-bandwidth bound). As the FlashAttention-3 algorithm naturally operates on blocks, we can scale each block of S to account for this block quantization at no computation cost.

but i don’t see any support for block scaling in the actual repo.

cly12188 commented 1 month ago

The FA3 paper says:

Accuracy: block quantization and incoherent processing. With FP8 (e4m3) format, one only uses 3 bits to store the mantissa and 4 bits for the exponent. This results in higher numerical error than FP16/BF16. Moreover, large models typically have outlier values [20, 54] that are much larger in magnitude than most other values, making quantization difficult. One typically use per-tensor scaling [37] by keeping one scalar per tensor (e.g., one for Q, for K, and for V). To reduce the numerical error of attention in FP8, we employ two techniques: 1. Block quantization: we keep one scalar per block, so that for each of Q, K, V we split the tensor into blocks of size π΅π‘Ÿ Γ— 𝑑 or 𝐡𝑐 Γ— 𝑑 and quantize them separately. This quantization can be fused with an operation right before attention (e.g., rotary embedding) with no additional slow down (since rotary embedding is memory-bandwidth bound). As the FlashAttention-3 algorithm naturally operates on blocks, we can scale each block of S to account for this block quantization at no computation cost.

but i don’t see any support for block scaling in the actual repo. Block quantization seems to have its code publicly available, but it appears that the code for Incoherent processing has not been released. Have you found the implementation location for Incoherent processing?

goldhuang commented 1 month ago

The FA3 paper says:

Accuracy: block quantization and incoherent processing. With FP8 (e4m3) format, one only uses 3 bits to store the mantissa and 4 bits for the exponent. This results in higher numerical error than FP16/BF16. Moreover, large models typically have outlier values [20, 54] that are much larger in magnitude than most other values, making quantization difficult. One typically use per-tensor scaling [37] by keeping one scalar per tensor (e.g., one for Q, for K, and for V). To reduce the numerical error of attention in FP8, we employ two techniques: 1. Block quantization: we keep one scalar per block, so that for each of Q, K, V we split the tensor into blocks of size π΅π‘Ÿ Γ— 𝑑 or 𝐡𝑐 Γ— 𝑑 and quantize them separately. This quantization can be fused with an operation right before attention (e.g., rotary embedding) with no additional slow down (since rotary embedding is memory-bandwidth bound). As the FlashAttention-3 algorithm naturally operates on blocks, we can scale each block of S to account for this block quantization at no computation cost.

but i don’t see any support for block scaling in the actual repo. Block quantization seems to have its code publicly available, but it appears that the code for Incoherent processing has not been released. Have you found the implementation location for Incoherent processing?

@cly12188 Could you share the link of block quantization implementation?