long input - larger than 510 tokens

DiddyC commented 1 year ago

Hi,

Does anyone have experience with running the model when there's more than 510 tokens? Is the best way to chunk the text and then run it twice with the same questions (perhaps with a stride)?

Also, any idea how to run it with multiple questions at once?

NormXU commented 1 year ago

Hi DiddyC,

Here are some ideas to increase the sequence length, which may be helpful for tasks that require longer input sequences

Some ideas

Let's say, we want to increase the max sequence length to 1024, twice of the original pre-trained sequence length. Duplicating the position embedding values from the range of 0-511 to the range of 512-1023 can be a simple and effective idea. In this way, the second set of 512 position embeddings will share the same weight values as the first set of 512 position embeddings.
Use RoPE (I actually tried this on a token classification task, but it greatly harms the performance) or ALiBi, related issue here
window slide； Here is an example. Suppose our training sequence length is 7 and our predicted length is 8, we can shift the available sequence across a predicted length, as shown in this image.

Hope these ideas can help

NormXU commented 1 year ago

hope this commit can help

NormXU / ERNIE-Layout-Pytorch

long input - larger than 510 tokens #17

Some ideas