Closed xyfZzz closed 1 year ago
@xyfZzz Flash attention by itself does not support ky_cache, so it is recomputing the cache again and again in our naive implementation. We have a member on vLLM team working on better support.
@xyfZzz Flash attention by itself does not support ky_cache, so it is recomputing the cache again and again in our naive implementation. We have a member on vLLM team working on better support.
I understand. Thank you for your explanation!
Hi, I've seen eval scripts mention that using flash attention will be slower. I am wondering why the use of flash attention in the inference stage will lead to slower, because I have the impression that flash attention can speed up.
https://github.com/DachengLi1/LongChat/blob/a824bda25c0082e60973c35c79b0f35d69c6be2d/longeval/eval.py#L62