Dao-AILab / flash-attention

Fast and memory-efficient exact attention
BSD 3-Clause "New" or "Revised" License
14.39k stars 1.35k forks source link

[Question]Does training and inference use the same quantization method in FA3? #1196

Open moses3017 opened 2 months ago

moses3017 commented 2 months ago

as titled cc @tridao @jayhshah

tridao commented 2 months ago

For FP8 we only support fwd pass for now.

moses3017 commented 2 months ago

Thank you for reply.

  1. I saw in fwd code, FP8 quantization parameters(such as descale_q, descale_k...) only support shape (1,). Will block quantization be supported in the future? And quantization parameters shape is (batch_size, nheads, seqlen//128, headdim//128)? 128 means per block size.
  2. Are there similar milestones in bwd and inference(like qkvpacked func)?