Positional encodings used with fused NA in NAT models

guillembraso commented 3 hours ago

Hi!

First of all, thanks a lot for this work. I have been using it extensively for a while and I believe it to be an incredible contribution. I am very grateful for it.

I have a question regarding the new fused NA kernels. As far as I understand, they do not support relative position encodings, which are used by default in NAT.

I wanted to ask you whether you have any suggestion on what to use as positional encodings whenever fused NA is enabled. For instance, what did you use for the fused columns in tables 5 and 6 of your NeurIPS paper? Those tables seem to suggest that you obtain equal performance with all three naive/GEMM and fused implementations? Or am I missing something?

Thanks again,

Guillem

alihassanijr commented 3 hours ago

Thank you for your interest, we're very happy you've found it useful.

You're correct; Fused Neighborhood Attention does not support relative positional bias in the backward pass, which means you can't train with relative positional biases, but you can run inference on it, which is what we presented in the paper.

However, the good news is that rotary embeddings (RoPE) are a great replacement for relative positional biases in these models. We're yet to release a preprint on our study on it, but in general, RoPE can be faster in both training and inference, and is much more scalable than RPB in terms of performance.

guillembraso commented 3 hours ago

Oh that's great to know, I'll read more about RoPE. Thanks so much for your reply!

SHI-Labs / NATTEN

Positional encodings used with fused NA in NAT models #173