Question about the implemented sparse attention

allenai / longformer

Longformer: The Long-Document Transformer

https://arxiv.org/abs/2004.05150

Apache License 2.0

2.05k stars 276 forks source link

Question about the implemented sparse attention #157

Open lhl2017 opened 3 years ago

lhl2017 commented 3 years ago

Dear authors,

I have a question about sparse attention that you implemented. Does Longformer-loop have the same accuracy as Longformer-chunk? Also, slinding_window_no_overlap version that you implemented in the latest code has the same accuracy as Longformer-loop?

In your latest version of the paper, you commented Longformer-loop is a naive implementation that computes each diagonal separately in a loop. As I know, each token only computes with related window size tokens. This method is memory efficient, but it is inefficient on GPU/TPU as you said. Because of many gather and scatter operations.

So you implemented longformer-chunk that is efficient on GPU/TPU. However, Longformer-chunk method computes attention between chunked tokens. It is different from Longformer-loop. Some tokens compute attention operations with more than window size tokens because of blocking(chunk).

In the following picture, you can see the differences easily.

Grey is longformer-loop. Green, red, yellow and blue are longformer-chunk. If I misunderstood your paper, please correct me. And, I'd appreciate you could help me.

Thanks!

matt-peters commented 3 years ago

Thanks for your questions and helpful diagram. To achieve the same result with the loop and chunk implementations, we can apply an attention mask to the smaller chunks. The details are buried somewhere in the code.

ibeltagy commented 3 years ago

loop and sliding_chunks are two different implementations of the same thing, so their accuracy is perfectly the same but loop is slow while sliding_chunks is much faster. Your diagram is correct. After the green, red, yellow and blue areas are computed, the areas outside the gray are removed, which make it equivalent to loop. The relevant part of the code is here https://github.com/allenai/longformer/blob/master/longformer/sliding_chunks.py#L73-L79. Green, red, yellow and blue are stored in diagonal_chunk_attn, and gray is diagonal_attn.

slinding_window_no_overlap accuracy is close to loop and sliding_chunks but not perfectly the same.

lhl2017 commented 3 years ago

Thanks for your quick reply. @matt-peters , @ibeltagy

I can understand the implementation of loop and sliding_chunks from your comments. I have some additional questions about sliding_window_no_overlap that is faster than sliding_chunks from your code comments, though it is not commented in your paper.

Is sliding_chunk_window_no_overlap under the accuracy than loop/sliding_chunks due to including the results of the area outside the gray? Is it only used in inference? Otherwise, finetuning/pre-training?

You said it is not perfectly the same as loop/sliding_chunks. As I know it computes attention operations at chunks without removing outside the gray area. The detail is almost similar to Bigbird except for rolling because of the zero pads and stack. In the end, it may be under the accuracy than loop/sliding_chunks due to including the results of the extra area. Is it correct that I understood? If I have a misunderstanding, please make me know.

Thanks a lot!

eiphy commented 3 years ago

One token attens to other w tokens in each side. For a chunk of size 6, the gray should occupy 7 cells instead of 3?

And no-overlap should attens less tokens. For the second chunk, if it is no-overlap, then the first tokens could only atten to other w tokens. I think this is the difference?

diaodeyi commented 2 years ago

Dear authors,

I have a question about sparse attention that you implemented. Does Longformer-loop have the same accuracy as Longformer-chunk? Also, slinding_window_no_overlap version that you implemented in the latest code has the same accuracy as Longformer-loop?

In your latest version of the paper, you commented Longformer-loop is a naive implementation that computes each diagonal separately in a loop. As I know, each token only computes with related window size tokens. This method is memory efficient, but it is inefficient on GPU/TPU as you said. Because of many gather and scatter operations.

So you implemented longformer-chunk that is efficient on GPU/TPU. However, Longformer-chunk method computes attention between chunked tokens. It is different from Longformer-loop. Some tokens compute attention operations with more than window size tokens because of blocking(chunk).

In the following picture, you can see the differences easily.

Grey is longformer-loop. Green, red, yellow and blue are longformer-chunk. If I misunderstood your paper, please correct me. And, I'd appreciate you could help me.

Thanks!

Hi, is there any code for the longfomer-loop?