Open void-main opened 2 years ago
Thanks for your interest! Does shifted window attention refer to the attention used in SwinTransformer? I haven't profiled it so I'd first figure out where the bottlenecks are.
dBias
for backward. The dSoftmax
is already computed but never stored to HBM as we want to reduce HBM reads/writes for performance.Hey @tridao, thanks for your reply!
I have a few follow up questions (sorry for so many questions):
- You're right, we'd need the bias term in the fwd, and
dBias
for backward. ThedSoftmax
is already computed but never stored to HBM as we want to reduce HBM reads/writes for performance.
Since the code is a little bit hard to follow (in a good way), could you please give me some hints on how to add dBias
to forward pass?
To be exact, I believe that I should add the bias after this line, however, how could I calculate the correct offset of the bias?what's the semantics of Mma_tile_p
and Mma_tile_o
? should I use Mma_tile_o
?
As of the dSoftmax
, do you think it's a good timing here to copy dSoftmax
to dBias
?
Also, I vaguely felt that there are some patterns of the code (gmem_tile, smem_tile, mma_tile), but it's kind of hard to connect these dots. Could you please give me an example of how data flows through these segments during forward pass of the QK^T pass, so that I could try to understand the code myself.
Big Thanks!
Hey authors, great repo for boosting training of Attention-based models. I wonder how the code can be ported to support (shifted)WindowAttention?
To my knowledge, the (S)WindowAttention differs from traditional Attention on:
Softmax(QK^T/sqrt(dim) + Bias)V^T
;According to this difference, here are several code that I found out that should be changed:
dSoftmax
asdBias
for bwd pass; besides we need corresponding iterator / gmem, smem loaders;16, 32, 64, 128
, could we extend this to other dim sizes?Could you please offer some guidance on how to port to WindowAttention? Thanks.