allenai / PRIMER

The official code for PRIMERA: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization
Apache License 2.0
153 stars 32 forks source link

How to finetune with a new dataset? #6

Open cammy-mun opened 2 years ago

cammy-mun commented 2 years ago

Hi, I am trying to finetune PRIMERA from huggingface using trainer, with a new dataset. However, i keep getting rouge scores of 0. May I know which part of the code is wrong?

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments import nltk import numpy as np TOKENIZER = AutoTokenizer.from_pretrained("allenai/PRIMERA") MODEL = AutoModelForSeq2SeqLM.from_pretrained("allenai/PRIMERA") import torch MODEL.gradient_checkpointing_enable() PAD_TOKEN_ID = TOKENIZER.pad_token_id DOCSEP_TOKEN_ID = TOKENIZER.convert_tokens_to_ids("<doc-sep>")

from huggingface_hub import notebook_login notebook_login()

here i load my own reformatted version of the multi_news dataset from huggingface - format is a (src,tgt) pair, where src is the related documents, tgt is the summary. its almost the same as the original multi_news dataset, just that i added a few more words at the front along with |||||.

train = load_dataset('cammy/multi_news_formatted_small', split='train[:100]', use_auth_token=True, cache_dir="D:") valid = load_dataset('cammy/multi_news_formatted_small', split='valid[:10]', use_auth_token=True, cache_dir="D:") test = load_dataset('cammy/multi_news_formatted_small', split='test[:10]', use_auth_token=True, cache_dir="D:")

then i do the preprocessing of data image

image

image

image

image

then lastly: trainer.train()

but these are the results: image

Wendy-Xiao commented 2 years ago

Hi,

I've never used trainer before. Based on the results, I would say there must be something wrong with either ground truth summary or generated summary, at least one of them is empty. I will suggest you check the outputs, and try to run a sanity check before fine-tuning.

theQuert commented 2 years ago

Hi, I am trying to finetune PRIMERA from huggingface using trainer, with a new dataset. However, i keep getting rouge scores of 0. May I know which part of the code is wrong?

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments import nltk import numpy as np TOKENIZER = AutoTokenizer.from_pretrained("allenai/PRIMERA") MODEL = AutoModelForSeq2SeqLM.from_pretrained("allenai/PRIMERA") import torch MODEL.gradient_checkpointing_enable() PAD_TOKEN_ID = TOKENIZER.pad_token_id DOCSEP_TOKEN_ID = TOKENIZER.convert_tokens_to_ids("<doc-sep>")

from huggingface_hub import notebook_login notebook_login()

here i load my own reformatted version of the multi_news dataset from huggingface - format is a (src,tgt) pair, where src is the related documents, tgt is the summary. its almost the same as the original multi_news dataset, just that i added a few more words at the front along with |||||.

train = load_dataset('cammy/multi_news_formatted_small', split='train[:100]', use_auth_token=True, cache_dir="D:") valid = load_dataset('cammy/multi_news_formatted_small', split='valid[:10]', use_auth_token=True, cache_dir="D:") test = load_dataset('cammy/multi_news_formatted_small', split='test[:10]', use_auth_token=True, cache_dir="D:")

then i do the preprocessing of data image

image

image

image

image

then lastly: trainer.train()

but these are the results: image

Hi friend, I'm also trying to fine-tune the model with my own dataset, is the trainer problem solved yet?

mdabedr commented 2 years ago

Can you please provide scripts for fine-tuning PRIMER on a new dataset? Details on that are scarce. By this I mean, could you add a bash script that would fine-tune on any dataset

mdabedr commented 2 years ago

Follow-up question, why does your dataset always require a fine-tuned model as one of the arguments. From what I gather, the argument model path either expects a model fine-tuned on (multi news, arxiv, etc) or the default which longfomer_summ_multinews. If we are fine-tuning, shouldn't the primer pre-trained model suffice??

jaineshdoshi commented 2 years ago

Hi, I also am attempting to explore the pretarined model and see if I can fine tune it for another dataset. I ran into an error like this while trying to finetune PRIMERA on the sample wcep dataset.

  File "__/software/Miniconda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "__/software/Miniconda/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "__/software/code-server/lib/vscode/extensions_omni/ms-python.python-2020.11.371526539/pythonFiles/lib/python/debugpy/__main__.py", line 45, in <module>
    cli.main()
  File "__/software/code-server/lib/vscode/extensions_omni/ms-python.python-2020.11.371526539/pythonFiles/lib/python/debugpy/../debugpy/server/cli.py", line 430, in main
    run()
  File "__/software/code-server/lib/vscode/extensions_omni/ms-python.python-2020.11.371526539/pythonFiles/lib/python/debugpy/../debugpy/server/cli.py", line 267, in run_file
    runpy.run_path(options.target, run_name=compat.force_str("__main__"))
  File "__/software/Miniconda/lib/python3.6/runpy.py", line 263, in run_path
    pkg_name=pkg_name, script_name=fname)
  File "__/software/Miniconda/lib/python3.6/runpy.py", line 96, in _run_module_code
    mod_name, mod_spec, pkg_name, script_name)
  File "__/software/Miniconda/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "__/work/instance1/jupyter/PRIMER_train/script/primer_main.py", line 792, in <module>
    train(args)
  File "__/work/instance1/jupyter/PRIMER_train/script/primer_main.py", line 528, in train
    trainer.fit(model, train_dataloader, valid_dataloader)
  File "__/software/Miniconda/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 460, in fit
    self._run(model)
  File "__/software/Miniconda/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 758, in _run
    self.dispatch()
  File "__/software/Miniconda/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 799, in dispatch
    self.accelerator.start_training(self)
  File "__/software/Miniconda/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 96, in start_training
    self.training_type_plugin.start_training(trainer)
  File "__/software/Miniconda/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 144, in start_training
    self._results = trainer.run_stage()
  File "__/software/Miniconda/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 809, in run_stage
    return self.run_train()
  File "__/software/Miniconda/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 844, in run_train
    self.run_sanity_check(self.lightning_module)
  File "__/software/Miniconda/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1112, in run_sanity_check
    self.run_evaluation()
  File "__/software/Miniconda/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 967, in run_evaluation
    output = self.evaluation_loop.evaluation_step(batch, batch_idx, dataloader_idx)
  File "__/software/Miniconda/lib/python3.6/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 174, in evaluation_step
    output = self.trainer.accelerator.validation_step(args)
  File "__/software/Miniconda/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 226, in validation_step
    return self.training_type_plugin.validation_step(*args)
  File "__/software/Miniconda/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 161, in validation_step
    return self.lightning_module.validation_step(*args, **kwargs)
  File "__/work/instance1/jupyter/PRIMER_train/script/primer_main.py", line 261, in validation_step
    loss = self.shared_step(input_ids, output_ids)
  File "__/work/instance1/jupyter/PRIMER_train/script/primer_main.py", line 145, in shared_step
    lm_logits = self.forward(input_ids, output_ids)
  File "__/work/instance1/jupyter/PRIMER_train/script/primer_main.py", line 114, in forward
    use_cache=False,
  File "__/software/Miniconda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "__/software/Miniconda/lib/python3.6/site-packages/transformers/models/bart/modeling_bart.py", line 1295, in forward
    return_dict=return_dict,
  File "__/software/Miniconda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "__/software/Miniconda/lib/python3.6/site-packages/transformers/models/bart/modeling_bart.py", line 1157, in forward
    return_dict=return_dict,
  File "__/software/Miniconda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "__/software/Miniconda/lib/python3.6/site-packages/transformers/models/bart/modeling_bart.py", line 796, in forward
    output_attentions=output_attentions,
  File "__/software/Miniconda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "__/software/Miniconda/lib/python3.6/site-packages/transformers/models/bart/modeling_bart.py", line 309, in forward
    output_attentions=output_attentions,
  File "__/software/Miniconda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
TypeError: forward() got an unexpected keyword argument 'hidden_states'

I am able to run the notebook (Evaluation_Example.ipynb) given in repo to give me a few extracted sentences for a few samples.

Library Versions used:

pytorch_lightning==1.3.8
torchmetrics==0.6.2
datasets==1.6.0
spacy==2.3.5
nltk==3.6.1
tqdm==4.49.0
rouge-score
torch==1.10.2
transformers==4.3.0

Did anyone else get this error? If so how did you solve it? Or am I incorrect on library versioning here?

Thanks!

jaineshdoshi commented 2 years ago

I may have figured out a way to solve my problem and be able to train PRIMERA on a dataset. The issue arose as the code uses the longformer library that is slightly out of sync with the transformers version of the same. So instead of using imports from longformer import LongformerEncoderDecoderForConditionalGeneration, LongformerEncoderDecoderConfig switch to from transformers import LEDForConditionalGeneration, LEDConfig and correspondingly in the code. This helped me train the model.

mdabedr commented 2 years ago

Hi Jainesh, would you be okay with sharing the repository for the changed code?

theQuert commented 2 years ago

Hi, Jainesh, I'm also working on the training process recently but still couldn't find any methods. Would you please release your modified code?

Wendy-Xiao commented 2 years ago

Hi all,

If you want to fine-tune PRIMERA on new datasets, I would suggest you use the hugging face version of PRIMERA and check out the file 'script/primera_hf_main.py' (as the original Longformer package is out of sync). To use PRIMERA-hf, you can install the latest version of huggingface transformer, and import the model as

from transformers import (
    AutoTokenizer,
    LEDConfig,
    LEDForConditionalGeneration,
)
tokenizer = AutoTokenizer.from_pretrained('allenai/PRIMERA')
config=LEDConfig.from_pretrained('allenai/PRIMERA')
model = LEDForConditionalGeneration.from_pretrained('allenai/PRIMERA')

I have not used Trainer in huggingface before, so I'm not sure what the problem would be. You may also consider using pytorch_lightning, which is used in 'script/primera_hf_main.py'. If you want to use Trainer, you can still refer to the training part in the script to see how to use the model to train. As for the evaluation, you can check the notebook Evaluation_Example.ipynb

mdabedr commented 2 years ago

is the max length input and output variable based on the target dataset, i.e. if training on a new dataset, do we assign these values ourselves from summary stats?

JohnGiorgi commented 2 years ago

Hi! I modified the official run_summarization.py script from HuggingFace and was able to fine-tune PRIMERA models with it. Figured I would share that script if it's useful to anyone else: https://gist.github.com/JohnGiorgi/8c7dcabd3ee8a362b9174c5d145029ab.

The main differences are:

  1. Truncate each document independently to length max_length // num_docs
  2. Add a global_attention_mask to the model_inputs, which is 1 for the bos_token and special "<doc-sep>" token, but 0 elsewhere.

You use the script the same way as the original run_summarization.py script, except you provide "allenai/PRIMERA-*" as the model_name_or_path.

zhangzx-uiuc commented 2 years ago

Hi @JohnGiorgi, did you encounter a problem where all the predictions becomes "" after a few hundreds steps of fine-tuning? I met this problem with both allenai/led-large-16384 and allenai/PRIMERA. (My issue was exactly the same with this one: https://github.com/huggingface/transformers/issues/18190)

JohnGiorgi commented 2 years ago

I did have the issue quite a while ago, and it has disappeared for me. There was a bug a while back where the Seq2SeqTrainer function was not taking into account the global_attention_mask which may have been the problem? Might be worth updating transformers to the latest version (if you haven't already) and trying again.

zhangzx-uiuc commented 2 years ago

Hi @JohnGiorgi, thanks for your reply! However I am still having this problem of running your provided script (https://gist.github.com/JohnGiorgi/8c7dcabd3ee8a362b9174c5d145029ab) with the newest version of transformers==4.21.0.dev0. I used the following command to run (on a 8*32GB V100 EC2 instance):

python run_summarization.py \
    --model_name_or_path allenai/PRIMERA \
    --do_train \
    --do_eval \
    --dataset_name multi_news \
    --dataset_config "3.0.0" \
    --source_prefix "summarize: " \
    --output_dir ./outputs \
    --per_device_train_batch_size=4 \
    --per_device_eval_batch_size=4 \
    --overwrite_output_dir \
    --predict_with_generate

The evaluation results are:

***** eval metrics *****
  epoch                   =        3.0
  eval_gen_len            =      128.0
  eval_loss               =     2.0331
  eval_rouge1             =        0.0
  eval_rouge2             =        0.0
  eval_rougeL             =        0.0
  eval_rougeLsum          =        0.0
  eval_runtime            = 0:11:05.88
  eval_samples            =       5621
  eval_samples_per_second =      8.441
  eval_steps_per_second   =      0.264

Not sure what causes this problem but there must still be something wrong with the generation method in huggingface implementations. But anyway, thanks much for your script and it is really helpful.