Pretraining and Fine-tuning Longformer : Increase token limit

mihaidobri commented 4 years ago

@ibeltagy Many thanks for sharing with the community Longformer! (and for all the details you are having in all the Issues section)

I have also a few questions : Q1:
if I want to pretrain Longformer-base-4096, with my custom dataset, but let's say, I set a limit bigger than 4096, (e.g 10k) when I will use the resulted model for fine-tuning ( text classification), I will still be limit to 4096 tokens ?

( background : I have a patent dataset from USPTO , and I want to do text classification on the class of a patent. But the average length of the words in my datasets is around 30k . The minimum length is 1000 and the maximum si 100k words)

Q2: just for the pretraining part, do I need to use the "convert to long" notebook? https://github.com/allenai/longformer/blob/master/scripts/convert_model_to_long.ipynb

Or I can use just the "How to train" notebook from HF ? https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb

Q3: Is feasible to create a "long" version of XLnet ?

simonlevine commented 4 years ago

Not an author but maybe this will help.

Q1: I'm not convinced the model will scale to 10,000 words per instance. I have a biomedical Longformer (elongated roberta) that seemed to do OK at max tokens = 8192. Is there some clever preprocessing you can do to chunk the patent instances? Like prior art section, other section, etc.? If it's only the utility, plant design, or international labels (i.e., 4 possible labels total) you should easily be able to find a way to learn the class a-posteriori without a 10k-token longformer. Maybe try the intro section and the 4 labels on a base-BERT For Sequence Classification (single dense layer, 4 = n_classes) first with cross-entropy loss. Otherwise, you'll need sufficient data and extensive computation to effectively train such a global attention mechanism. The answer, though, is yes, there should be no limit on the global attention window per se.

Q2: No. If I understand correctly you instantiate a language modeling class and build a pipeline like pretrain_and_evaluate(...) from the demo notebook. And you'll likely run into issues on CoLab for a task this size.

Q3:Yes, it's feasible. I think this question was asked and answered in another thread.

Good luck!

mihaidobri commented 4 years ago

@simonlevine Thank you so much for your help! I ended up using just the 4096 tokens supported by "default"

allenai / longformer

Pretraining and Fine-tuning Longformer : Increase token limit #121