fp8: add calibration scale for decode attention operators

flashinfer-ai / flashinfer

FlashInfer: Kernel Library for LLM Serving

https://flashinfer.ai

Apache License 2.0

1.22k stars 115 forks source link

Closed yzh119 closed 4 months ago

yzh119 commented 4 months ago

@comaniac

comaniac commented 4 months ago

Looks super clean lol It'd be better to have a test to verify the correctness

yzh119 commented 4 months ago

I feel we can directly use fp16 for query (but use fp8 for kv) to avoid possible accuracy loss, but let's merge this first.