Open bulaikexiansheng opened 3 months ago
Hello,
We use flash attention function which already has causal mask for prefilling phase.
It should be noted that it is easy to have OOM issue when you are trying to compute attention matrix directly for long sequences.
Hi, I would like to ask why the attention mask is not used in the prefill stage. I want to output the attention scores matrix in prefill stage. Is the code below right?