Open lhl2017 opened 3 years ago
Thanks for your questions and helpful diagram. To achieve the same result with the loop and chunk implementations, we can apply an attention mask to the smaller chunks. The details are buried somewhere in the code.
loop
and sliding_chunks
are two different implementations of the same thing, so their accuracy is perfectly the same but loop
is slow while sliding_chunks
is much faster.
Your diagram is correct. After the green, red, yellow and blue areas are computed, the areas outside the gray are removed, which make it equivalent to loop
. The relevant part of the code is here https://github.com/allenai/longformer/blob/master/longformer/sliding_chunks.py#L73-L79. Green, red, yellow and blue are stored in diagonal_chunk_attn
, and gray is diagonal_attn
.
slinding_window_no_overlap
accuracy is close to loop
and sliding_chunks
but not perfectly the same.
Thanks for your quick reply. @matt-peters , @ibeltagy
I can understand the implementation of loop and sliding_chunks from your comments. I have some additional questions about sliding_window_no_overlap that is faster than sliding_chunks from your code comments, though it is not commented in your paper.
Is sliding_chunk_window_no_overlap under the accuracy than loop/sliding_chunks due to including the results of the area outside the gray? Is it only used in inference? Otherwise, finetuning/pre-training?
You said it is not perfectly the same as loop/sliding_chunks. As I know it computes attention operations at chunks without removing outside the gray area. The detail is almost similar to Bigbird except for rolling because of the zero pads and stack. In the end, it may be under the accuracy than loop/sliding_chunks due to including the results of the extra area. Is it correct that I understood? If I have a misunderstanding, please make me know.
Thanks a lot!
One token attens to other w tokens in each side. For a chunk of size 6, the gray should occupy 7 cells instead of 3?
And no-overlap should attens less tokens. For the second chunk, if it is no-overlap, then the first tokens could only atten to other w tokens. I think this is the difference?
Dear authors,
I have a question about sparse attention that you implemented. Does Longformer-loop have the same accuracy as Longformer-chunk? Also, slinding_window_no_overlap version that you implemented in the latest code has the same accuracy as Longformer-loop?
In your latest version of the paper, you commented Longformer-loop is a naive implementation that computes each diagonal separately in a loop. As I know, each token only computes with related window size tokens. This method is memory efficient, but it is inefficient on GPU/TPU as you said. Because of many gather and scatter operations.
So you implemented longformer-chunk that is efficient on GPU/TPU. However, Longformer-chunk method computes attention between chunked tokens. It is different from Longformer-loop. Some tokens compute attention operations with more than window size tokens because of blocking(chunk).
In the following picture, you can see the differences easily.
Grey is longformer-loop. Green, red, yellow and blue are longformer-chunk. If I misunderstood your paper, please correct me. And, I'd appreciate you could help me.
Thanks!
Hi, is there any code for the longfomer-loop?
Dear authors,
I have a question about sparse attention that you implemented. Does Longformer-loop have the same accuracy as Longformer-chunk? Also, slinding_window_no_overlap version that you implemented in the latest code has the same accuracy as Longformer-loop?
In your latest version of the paper, you commented Longformer-loop is a naive implementation that computes each diagonal separately in a loop. As I know, each token only computes with related window size tokens. This method is memory efficient, but it is inefficient on GPU/TPU as you said. Because of many gather and scatter operations.
So you implemented longformer-chunk that is efficient on GPU/TPU. However, Longformer-chunk method computes attention between chunked tokens. It is different from Longformer-loop. Some tokens compute attention operations with more than window size tokens because of blocking(chunk).
In the following picture, you can see the differences easily.
Grey is longformer-loop. Green, red, yellow and blue are longformer-chunk. If I misunderstood your paper, please correct me. And, I'd appreciate you could help me.
Thanks!