allenai / PRIMER

The official code for PRIMERA: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization
Apache License 2.0
153 stars 32 forks source link

Using the (pretrained) model on new data #2

Closed MorenoLaQuatra closed 2 years ago

MorenoLaQuatra commented 2 years ago

Hi,

First of all many thanks to the whole team for the amazing work. I'm trying to use the pretrained model (on MultiNews) to make inference on new data. At the moment I'm just trying with the test set of Multinews itself.

I instantiate the model as suggested:

tokenizer = AutoTokenizer.from_pretrained(model_path)
config = LongformerEncoderDecoderConfig.from_pretrained(model_path)
model = LongformerEncoderDecoderForConditionalGeneration.from_pretrained(model_path, config=config).to(device)

Then I prepare the input similarly to any other HF model (I set max_input_length=4096):

inputs_dict = tokenizer(input_docs, padding="max_length", max_length=max_input_length, return_tensors="pt", truncation=True)
input_ids = inputs_dict.input_ids.to(device)
attention_mask = inputs_dict.attention_mask.to(device)

At the end I use the following to generate the summary:

predicted_ids = model.generate(input_ids, attention_mask=attention_mask)
text = tokenizer.batch_decode(predicted_ids, skip_special_tokens=True)

However, the summaries are very short if compared with what was expected (at least for MNews). Hereafter an example of the output:

– Voters in 11 states will pick their governors tonight, and Republicans appear on track to increase their

It even seems to be truncated, is there something I'm doing wrong?

MorenoLaQuatra commented 2 years ago

I think I got it, I updated the generate call as follows:

predicted_ids = model.generate(input_ids, attention_mask=attention_mask, max_length=max_output_length, min_length=0, num_beams=5, no_repeat_ngram_size=3)

Is there any suggested value? (I set max_output_length=256 here)

Wendy-Xiao commented 2 years ago

Hi, thanks for your interest in our work!

About the settings we used for different datasets, you can find them in /run_bash/*.sh, and specifically for Multi-News, if you use the non-finetuned model, I would suggest you use max_output_length=256, and if you use fine-tuned model (on Multi-news), you can safely use max_output_length=1024, as the desired length is learned by fine-tuning on the in-domain dataset.

And if you want to apply the model on a new dataset, kindly set max_output_length to any desired length (e.g. avg length of summary in the training set/ or the length that you want the generated summaries to be)