flashinfer-ai / flashinfer

FlashInfer: Kernel Library for LLM Serving
https://flashinfer.ai
Apache License 2.0
1.22k stars 115 forks source link

fp8: add calibration scale for decode attention operators #273

Closed yzh119 closed 4 months ago

yzh119 commented 4 months ago

@comaniac

comaniac commented 4 months ago

Looks super clean lol It'd be better to have a test to verify the correctness

yzh119 commented 4 months ago

I feel we can directly use fp16 for query (but use fp8 for kv) to avoid possible accuracy loss, but let's merge this first.