Closed thibautlavril closed 5 months ago
Hi,
BlockDiagonalCausalWithOffsetPaddedKeysMask
is made for inference (the keys are padded because they are on the KV-cache).
For what you are asking, we don't have a nice way to support it at the moment. I see 2 ways:
seqstart
to do what you want (eg the second bias would have something like `[0, 0, 4, 4] - with 2 empty sequences) - but this wasn't tested so I'm not sure how the kernels react to that...Thank you very much for your answer!
Yes I think I am in the case 1. I am probably missing something but I dont know how I can generate sub bias.
For instance if the seqlens are [5, 4, 3] and I want to divide in 3x3 (for ring attention for instance), I am not sure on how to create a BLockDiagonalMask for the tile (i, j) .
Do you have a pointer to an example somewhere?
Thank you again!
So you need to create 9 attention biases for 3x3? If the use-case is for RingAttention, it might not be efficient (as you will have imbalance between your ranks depending on the bias shape, some GPUs will be faster, and will wait for the slowest GPU).
So you want to do something like that:
For A
, E
and I
it's just a regular BlockCausalDiagonalMask
.
The B
, C
and F
regions are empty (assume causality + same size for the keys/queries sequences).
For D
, G
and H
it's a bit more tricky, because some keys don't attend to anything (which could cause nans in the BW pass), or some queries are not attended by anyone (which could cause nans
in the attention output). It might be easier to slice the keys/queries, as I don't know if we have a bias that would work in that case
Ok got it, will try to do custom handling for the moment!
By curiosity what approach will be taken for masking for in ring attention in xformers ?
(Feel free to close the issue)
Thanks again for your answers !
By curiosity what approach will be taken for masking for in ring attention in xformers ?
We don't plan to support document masking in the first version (not released yet), just regular (causal) attention
❓ Questions and Help
Hello!
I wanted to know if it was possible to "split" a BlockCausalDiagonalMask (or to create the 3 equivalent AttentionBias) with this pattern:
(the goal is to split the queries into q1, q2, q3 and keep the same kv) I had a look at
BlockDiagonalCausalWithOffsetPaddedKeysMask
but not sure I can use it is made for this purposeThank you very much!