Closed relic-yuexi closed 2 months ago
As the docstring says:
return_attn_probs: bool. Whether to return the attention probabilities. This option is for
testing only. The returned probabilities are not guaranteed to be correct
(they might not have the right scaling).
Tbh I don't see an easy way to get CoPE to go fast.
I originally thought that just speeding up the qk operation would be enough. 😂
Dear Sir,
I hope this message finds you well. I am currently working on integrating the Contextual Position Encoding (CoPE) module with the FlashAttention mechanism, and I am reaching out to inquire if there is an existing or recommended method to facilitate this integration seamlessly.
Below is a brief overview of the CoPE module I am attempting to integrate:
I have found that there is a par called
return_attn_probs=True
, but i find that is mismatch.get
Attention probabilities: tensor([[[7.7782]]], device='cuda:0')
get
tensor([[[[7.7773]]]], device='cuda:0', dtype=torch.float16)
I am particularly interested in understanding if there are any existing utilities or guidelines within the FlashAttention framework that could simplify the process of incorporating CoPE. Additionally, any insights or suggestions on potential challenges or optimizations would be greatly appreciated.
Thank you for your time and assistance. I look forward to your response.