I saw in fwd code, FP8 quantization parameters(such as descale_q, descale_k...) only support shape (1,). Will block quantization be supported in the future? And quantization parameters shape is (batch_size, nheads, seqlen//128, headdim//128)? 128 means per block size.
Are there similar milestones in bwd and inference(like qkvpacked func)?
as titled cc @tridao @jayhshah