Closed fahadh4ilyas closed 1 year ago
@fahadh4ilyas Sorry for the confusion. The codes are intended to be used with use_cache=False, output_attentions=False. We will clean the use_cache logic in the next refactoring. Thanks for letting us know!
So, past_key_value
is not supported in xformers?
Right, it is intended for Training purpose, where past_key_value is always not provided.
Okay, that makes sense. Which one is faster? Using flash attention or xformers?
In my benchmark during experiments, they are the same. Xformer supports V100, but flash does not.
What about for inference? Doesn't inference need past_key_value
? Does this mean xformers can't be used for inference?
Yes, xformer and flash attention are not intended for inference in their most naive usage. We have more advanced system supporting this, and some models already been supported by this system in FastChat. LongChat is the next todo.
Okay right... Thank you for the answer...
@DachengLi1 Hi, so if one need support v100 training, it's better using xformers?
BTW, does it work if full params finetune with xformers or flashattn?
I'm confused with this script longchat/train/monkey_patch/llama_xformer_monkey_patch.py. I thought by not allowing
use_cache
parameter toTrue
meanspast_key_value
will not be used and not return. But, somehowpast_key_value
is still used butattn_weights
is alwaysNone
. Did you mean to setinstead of