Dao-AILab / flash-attention

Fast and memory-efficient exact attention
BSD 3-Clause "New" or "Revised" License
13.72k stars 1.26k forks source link

[Question]Does training and inference use the same quantization method in FA3? #1196

Open moses3017 opened 1 month ago

moses3017 commented 1 month ago

as titled cc @tridao @jayhshah

tridao commented 1 month ago

For FP8 we only support fwd pass for now.

moses3017 commented 1 month ago

Thank you for reply.

  1. I saw in fwd code, FP8 quantization parameters(such as descale_q, descale_k...) only support shape (1,). Will block quantization be supported in the future? And quantization parameters shape is (batch_size, nheads, seqlen//128, headdim//128)? 128 means per block size.
  2. Are there similar milestones in bwd and inference(like qkvpacked func)?