[BERT] Implementation of the sliding window for long sequences

allenai / allennlp

An open-source NLP research library, built on PyTorch.

http://www.allennlp.org

Apache License 2.0

11.75k stars 2.25k forks source link

[BERT] Implementation of the sliding window for long sequences #3921

Closed wasiahmad closed 4 years ago

wasiahmad commented 4 years ago

I was trying to look for the references where the sliding window is implemented to process long sequences. How do we split a long sequence and then after getting the embeddings, how do we unpack them? Is it possible to describe the main trick? I am trying to implement it in plain PyTorch. I am unable to implement in Batches without running any loops.

Any help would be appreciated.

bratao commented 4 years ago

I did something for elmo, but take with a grain of salt because I did not did a literature review on the techniques.

In Allennlp trainer, there is an argument called num_gradient_accumulation_steps The description is: Gradients are accumulated for the given number of steps before doing an optimizer step. This can be useful to accommodate batches that are larger than the RAM size. Refer Thomas Wolf's post for details on Gradient Accumulation.

I use a similar concept where I split the sequence in pieces of X size, and train the network without resetting the gradients between the pieces of the same sequence.

This solve the problem of doing a sequence learning on a big sequence. For getting the BERT representation you would probably have to you the sliding window yourself.

wasiahmad commented 4 years ago

Actually, in my use case, I need to take the BERT embeddings and run through several other layers, which is not similar to only take the contextual embeddings of each element in an input sequence and just pass through a linear classifier to predict something (e.g, in SQuAD). So, I really need to crack the technique that is used in Allennlp so that I can do something similar in plain PyTorch.

ZhaofengWu commented 4 years ago

We have something in PretrainedTransformerIndexer that splits long sequences into non-overlapping segments: https://github.com/allenai/allennlp/blob/a88c3f86d7929e2fe0bba225232177d720246bc5/allennlp/data/token_indexers/pretrained_transformer_indexer.py#L34-L38

We do not have an implementation of sliding windows yet.

dirkgr commented 4 years ago

There are a lot of aspects to your question.

If you just want to implement a sliding window without using loops, there are ways to do that using torch.gather() or indexing into a tensor with another tensor. You could even push creating the sliding windows into your reader, so you don't have to worry about doing this in the model at all. Or you could just use loops. It's not the end of the world.

I am wondering though why you want to do it. Since you still have to store the activations for all windows in memory, you won't save any GPU memory doing this. The only thing you gain is the ability to process sequences longer than 512 word pieces, as long as you have enough memory.

If you still want to do it, you could do what @ZhaofengWu suggested and use non-overlapping windows. If you want to overlap the windows, and you're wondering what to do about word pieces that are in multiple windows, I would start by just adding them and seeing how that goes.

dirkgr commented 4 years ago

I'm closing this for lack of activity, but feel free to open another if you have more questions.