SHI-Labs / Neighborhood-Attention-Transformer

Neighborhood Attention Transformer, arxiv 2022 / CVPR 2023. Dilated Neighborhood Attention Transformer, arxiv 2022
MIT License
1.05k stars 86 forks source link

CUDA out of memory #29

Closed zhangzheng2242 closed 2 years ago

zhangzheng2242 commented 2 years ago

Your work is very good and we have improved our transformer model based on your ideas, but why CUDA out of memory at the same batch_size?In theory, the computation should be reduced and the batch_size should be able to be set to a larger size.

alihassanijr commented 2 years ago

Hello and thank you for your interest. Could you provide more details? As in what task it is, what the batch size was, and what changed? Technically NA's memory usage (with the latest CUDA extension) should be even slightly less than a Window Self-Attention block. But if you replaced some other form of attention with NA, it may be different depending on implementation.

As far as memory goes NA itself is quite memory-efficient. It doesn't compute no intermediary tensors other than the attention weights.

Also, please keep in mind that theoretical computation won't always align with memory usage, there's actually no guarantee there. But also keep in mind that for instance deeper models tend to use up a lot more memory when training (a bunch of reasons, for instance the increased context and gradient accumulation), but they will end up using less memory at inference.

I hope this clarifies things a bit, but if it doesn't, feel free to continue the discussion.

zhangzheng2242 commented 2 years ago

Hello, thank you for your reply. I want to use your theory to solve a problem of neighborhood attention. For example, suppose we get the original QKV dimension (B=10, head_num=1, token_num =15, token_dim =128). 15 tokens is a one-dimensional relationship. We want to calculate the attention of window_size=5.(q[0]v[0,1,2,3,4]; q[1]v[0,1,2,3,4]; q[2]v[0,1,2,3,4]; q[3]v[1,2,3,4,5], etc.) Generate the corresponding attn_index dimension as (15,5) : Tensor ([[0, 1, 2, 3, 4], [0, 1, 2, 3, 4], [0, 1, 2, 3, 4], [1, 2, 3, 4, 5] [2, 3, 4, 5, 6] [3, 4, 5, 6, 7] [4, 5, 6, 7, 8] [5, 6, 7, 8, 9] [6, 7, 8, 9, 10] [7, 8, 9, 10, 11] [8, 9, 10, 11, 12] [9, 10, 11, 12, 13] [10, 11, 12, 13, 14] [10, 11, 12, 13, 14] [10, 11, 12, 13, 14] Modified original Transformer Attn = (q.unsqueeze (3) @ k [:, :, attn_index ]. Transpose (-2,-1)) # A = Q k ^ T Attn = attn self.rescale Attn = attn. Softmax (dim = -1) X = (attn @ v[ : , : , attn_index]). Squeeze (3) I don't know if there is a problem with the code and theory I changed, especially with respect to indexes

alihassanijr commented 2 years ago

Hi, I'm not sure I understand, but the indices you seem to expect appear to be with respect to a 1D neighborhood and not 2D. For that, you can use NeighborhoodAttention1d from natten, which we just added. This class works on 1D data (Batch, Heads, Length, Dim) and would probably be more suitable to your case, if I understand it correctly.

zhangzheng2242 commented 2 years ago

Thank you very much for your help!

zhangzheng2242 commented 2 years ago

gradcheck.py can run successfully image But python3 natten/gradcheck1d.py about 1D NA no error and no response. image

Hello, I have a problem with no response while running 'python3 natten/gradcheck1d.py # 1D NA', no error and no response. But it worked when I ran 'python3 natten/gradcheck.py'.I'm not sure if there's something wrong with the 1dCUDA Extension.

alihassanijr commented 2 years ago

Well you would have to wait for 1D to compile as well, it will start compiling upon calling gradcheck1d.py, if you're using ninja. How long have you let gradcheck1d.py run? Have you noticed CUDA processes starting?

zhangzheng2242 commented 2 years ago

Thank you, I have found the solution, there is no problem with your code, very good work

zhangzheng2242 commented 2 years ago

Thank you, I have found the solution, there is no problem with your code, very good work