Code to generate longformer-base-4096

arashashari commented 4 years ago

Hello, I wonder if you have the code that generated this pre-trained model. (or longformer-large-4096-finetuned-triviaqa)

I followed the steps in your notebook and updated the triviaqa.py file; it is attached.

However, I get some miss match dimension in linear operations as follow: Traceback (most recent call last): File "/home/arashari/longformer_original/scripts/triviaqa.py", line 760, in main(args) File "/home/arashari/longformer_original/scripts/triviaqa.py", line 752, in main trainer.fit(model) File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 979, in fit self.single_gpu_train(model) File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/distrib_parts.py", line 185, in single_gpu_train self.run_pretrain_routine(model) File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 1139, in run_pretrain_routine False) File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/evaluation_loop.py", line 293, in _evaluate output = self.evaluation_forward(model, batch, batch_idx, dataloader_idx, test_mode) File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/evaluation_loop.py", line 485, in evaluation_forward output = model.validation_step(*args) File "/home/arashari/longformer_original/scripts/triviaqa.py", line 440, in validation_step output = self.forward(input_ids, input_mask, segment_ids, subword_starts, subword_ends) File "/usr/local/lib/python3.6/dist-packages/apex/amp/_initialize.py", line 197, in new_fwd *applier(kwargs, input_caster)) File "/home/arashari/longformer_original/scripts/triviaqa.py", line 362, in forward logits = self.qa_outputs(sequence_output) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(input, **kwargs) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/linear.py", line 87, in forward return F.linear(input, self.weight, self.bias) File "/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py", line 1610, in linear ret = torch.addmm(bias, input, weight.t()) RuntimeError: size mismatch, m1: [1 x 2376], m2: [768 x 2] at /pytorch/aten/src/THC/generic/THCTensorMathBlas.cu:283

Run time arguments are: python -m triviaqa \ --train_dataset output/squad-wikipedia-train-4096.json \ --dev_dataset squad-wikipedia-dev-4096.json \ --gpus 0,1,2,3,4,5,6,7 --batch_size 1 --num_workers 4 \ --lr 0.00003 --warmup 1000 --epochs 4 --max_seq_len 4096 --doc_stride -1 \ --save_prefix data/output/ \ --seed 4321 \ --attention_mode 'sliding_chunks' \ --model_path roberta-base

Any comment/suggestion on this?

Further, I tried the run_squad from huggingface and seems like the converter (triviaqa to squad) does not add title; it is not compatible with this example. For example it is complaining about fields such as "title". Are you aware of any other converter? triviaqa.txt

arashashari commented 4 years ago

Any comment on this?

ibeltagy commented 4 years ago

Creating this checkpoint is a long process that involves a lot of data and pretraining. The notebook you referenced in the closest we have to reproduce this process. That said, the error you are seeing is not related to this. Can you place a breakpoint before the line logits = self.qa_outputs(sequence_output) to check that the shapes of sequence_output, self.qa_outputs.weights, self.qa_outputs.bias and see which one has a shape different from the expected.

arashashari commented 4 years ago

Thanks @ibeltagy, I was able to get around that issue and fixed it. However, performance-wise the result is not good as I am just using a small amount of data based on the notebook while testing on TriviaQA.

{'exact_match': 5.079444513949706, 'f1': 13.769592296107348, 'common': 7993, 'denominator': 7993, 'pred_len': 7993, 'gold_len': 7993}

I wonder if you can share the long process you mentioned. Currently I am doing a comparison with a very similar model which results a better performance. However, being able to re-do the process you have done could change it.

Thanks

ibeltagy commented 4 years ago

The process is described in detail in the paper, but the lack of careful pretraining shouldn't affect the result that much. Check table 11 Longformer (no MLM pretraining) --> 73.2. An F1 score of 13.7 indicates a bug not a lack of enough pretraining. My guess is that the TriviaQA script is starting from a randomly initialized model not from the pretrained one, maybe because of a wrong path or something.

arashashari commented 4 years ago

Thanks @ibeltagy I kind of agree that there is an issue with model load for the QA part. Here is the config I get at the end from the notebook: { "architectures": [ "RobertaLongForMaskedLM" ], "attention_probs_dropout_prob": 0.1, "attention_window": [ 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512 ], "bos_token_id": 0, "eos_token_id": 2, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 768, "initializer_range": 0.02, "intermediate_size": 3072, "layer_norm_eps": 1e-05, "max_position_embeddings": 4098, "model_type": "roberta", "num_attention_heads": 12, "num_hidden_layers": 12, "pad_token_id": 1, "type_vocab_size": 1, "vocab_size": 50265 } Even when I reload the saved model and evaluate it for MLM, it gives me the correct bpc (1.635 same as value I get before saving)

I can see that in your TriviaQA code you load the Longformer model. That has a different config. But based on the paper, that model is pretrained from an MLM; which I expect to be similar to the notebook. I have attached the TriviaQA triviaqa_lf.txt script I have updated to load the MLM model. I wonder if you can have a quick look and comment on what may cause issue in loading the model. (only difference is using the new load model - few lines change)

Meanwhile, the code based on the data you have provided works fine. However, I am trying to repro the pretrained model.

allenai / longformer

Code to generate longformer-base-4096 #84