achen353 / TransformerSum

BERT-based extractive summarizer for long legal document using a divide-and-conquer approach
GNU General Public License v3.0
3 stars 0 forks source link

Testing on Abstractive Summarizer #1

Closed achen353 closed 2 years ago

achen353 commented 2 years ago

TODO

achen353 commented 2 years ago

@andywang268 Can you dive into how abstractive summarization works? (not how the data is pre-processed, but given the ready-to-use dataset, how does the training work).

I also updated the TODO on the top.

andywang268 commented 2 years ago

Longformer

Idea

Longformer is a modified Transformer architecture. Traditional Transformer-based models are unable to process long sequences due to their self-attention operation, which scales quadratically with the sequence length. To address this, Longformer uses an attention pattern that scales linearly with sequence length, making it easy to process documents of thousands of tokens or longer. The attention patterns include sliding window, dilated sliding window and global attention.

Sliding Window

Given the importance of local context, our attention pattern employs a fixed-size window attention surrounding each token. Using multiple stacked layers of such windowed attention results in a large receptive field, where top layers have access to all input locations and have the capacity to build representations that incorporate information across the entire input, similar to CNNs. Depending on the application, it might be helpful to use different values of w for each layer to balance between efficiency and model representation capacity.

Screen Shot 2021-11-16 at 5 04 10 PM

Dilated Sliding Window

To further increase the receptive field without increasing computation, the sliding window can be “dilated”. This is analogous to dilated CNNs where the window has gaps of size dilation d.

Screen Shot 2021-11-16 at 5 04 21 PM

Global Attention

The windowed and dilated attention are not flexible enough to learn task specific representations. Accordingly, “global attention” is added on few pre-selected input locations. Importantly, we make this attention operation symmetric: that is, a token with a global attention attends to all tokens across the sequence, and all tokens in the sequence attend to it. For example for classification, global attention is used for the [CLS] token while in QA global attention is provided on all question tokens. Since the number of such tokens is small relative to and independent of n the complexity of the combined local and global attention is still O(n).

Screen Shot 2021-11-16 at 5 04 37 PM

Notes

The author uses small window sizes for the lower layers and increase window sizes as moving to higher layers. This allows the top layers to learn higher-level representation of the entire sequence while having the lower layers capture local information. In addition, it provides balance between efficiency and performance (larger window sizes have richer representation power and often result in performance improvements). The author uses dilated sliding windows for higher layers. This gives the model the ability to directly attend to distant tokens without sacrificing local context.

achen353 commented 2 years ago
  • Understand how it works: how are data transformed throughout the process, difference against Extractive Summarizer (ES)

Pushing this back for now