Why the use of flash attention in the inference stage will lead to slower？

DachengLi1 / LongChat

Official repository for LongChat and LongEval

Apache License 2.0

504 stars 29 forks source link

Why the use of flash attention in the inference stage will lead to slower？ #27

Closed xyfZzz closed 1 year ago

xyfZzz commented 1 year ago

Hi, I've seen eval scripts mention that using flash attention will be slower. I am wondering why the use of flash attention in the inference stage will lead to slower, because I have the impression that flash attention can speed up.

https://github.com/DachengLi1/LongChat/blob/a824bda25c0082e60973c35c79b0f35d69c6be2d/longeval/eval.py#L62

DachengLi1 commented 1 year ago

@xyfZzz Flash attention by itself does not support ky_cache, so it is recomputing the cache again and again in our naive implementation. We have a member on vLLM team working on better support.

xyfZzz commented 1 year ago

@xyfZzz Flash attention by itself does not support ky_cache, so it is recomputing the cache again and again in our naive implementation. We have a member on vLLM team working on better support.

I understand. Thank you for your explanation!