Monkey Patch Xformers use `past_key_value` but `use_cache` can't be `True`?

DachengLi1 / LongChat

Official repository for LongChat and LongEval

Apache License 2.0

504 stars 29 forks source link

Monkey Patch Xformers use `past_key_value` but `use_cache` can't be `True`? #15

Closed fahadh4ilyas closed 1 year ago

fahadh4ilyas commented 1 year ago

I'm confused with this script longchat/train/monkey_patch/llama_xformer_monkey_patch.py. I thought by not allowing use_cache parameter to True means past_key_value will not be used and not return. But, somehow past_key_value is still used but attn_weights is always None. Did you mean to set

assert not output_attentions, "xformer does not support output_attentions=True"

instead of

assert not use_cache, "xformer does not support use_cache=True"

DachengLi1 commented 1 year ago

@fahadh4ilyas Sorry for the confusion. The codes are intended to be used with use_cache=False, output_attentions=False. We will clean the use_cache logic in the next refactoring. Thanks for letting us know!

fahadh4ilyas commented 1 year ago

So, past_key_value is not supported in xformers?

DachengLi1 commented 1 year ago

Right, it is intended for Training purpose, where past_key_value is always not provided.

fahadh4ilyas commented 1 year ago

Okay, that makes sense. Which one is faster? Using flash attention or xformers?

DachengLi1 commented 1 year ago

In my benchmark during experiments, they are the same. Xformer supports V100, but flash does not.

fahadh4ilyas commented 1 year ago

What about for inference? Doesn't inference need past_key_value? Does this mean xformers can't be used for inference?

DachengLi1 commented 1 year ago

Yes, xformer and flash attention are not intended for inference in their most naive usage. We have more advanced system supporting this, and some models already been supported by this system in FastChat. LongChat is the next todo.

fahadh4ilyas commented 1 year ago

Okay right... Thank you for the answer...

lucasjinreal commented 1 year ago

@DachengLi1 Hi, so if one need support v100 training, it's better using xformers?

BTW, does it work if full params finetune with xformers or flashattn?