int8 kv cache 和 Flash Attention 无法一起使用

InternLM / lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.

https://lmdeploy.readthedocs.io/en/latest/

Apache License 2.0

4.3k stars 386 forks source link

Closed SeibertronSS closed 3 months ago

SeibertronSS commented 3 months ago

我之前一直在参考 LMDeploy 0.14.0 的代码，发现 int8 kv cache 和 Flash Attention 无法一起使用，但是这个问题在后续的版本中得到了修复，请问是什么原因导致的呢？

lzhangzz commented 3 months ago

现在的版本需要 prefill 时会先把 kv cache dequant。

SeibertronSS commented 3 months ago

好的