NormXU / ERNIE-Layout-Pytorch

An unofficial Pytorch implementation of ERNIE-Layout which is originally released through PaddleNLP.
http://arxiv.org/abs/2210.06155
MIT License
96 stars 11 forks source link

long input - larger than 510 tokens #17

Closed DiddyC closed 1 year ago

DiddyC commented 1 year ago

Hi,

Does anyone have experience with running the model when there's more than 510 tokens? Is the best way to chunk the text and then run it twice with the same questions (perhaps with a stride)?

Also, any idea how to run it with multiple questions at once?

NormXU commented 1 year ago

Hi DiddyC,

Here are some ideas to increase the sequence length, which may be helpful for tasks that require longer input sequences

Some ideas

  1. Let's say, we want to increase the max sequence length to 1024, twice of the original pre-trained sequence length. Duplicating the position embedding values from the range of 0-511 to the range of 512-1023 can be a simple and effective idea. In this way, the second set of 512 position embeddings will share the same weight values as the first set of 512 position embeddings.
  2. Use RoPE (I actually tried this on a token classification task, but it greatly harms the performance) or ALiBi, related issue here
  3. window slide; Here is an example. Suppose our training sequence length is 7 and our predicted length is 8, we can shift the available sequence across a predicted length, as shown in this image. WechatIMG909

Hope these ideas can help

NormXU commented 1 year ago

hope this commit can help