Open zhangxin81 opened 6 months ago
They have some common optimization idea, like fusing the multi head attention kernel, quantizing the model to int8 or fp8.
hI @byshiue, a related question. Does BertAttentionPlugin
also use FlashAttention2 that GptAttention
uses?
Yes.
Is there any fesature related to GPT-like models that can be applied to BERT-like models?