Implement a ROUGE metric that faithfully reproduces the official metric written in perl.

niansong1996 commented 3 years ago

Checklist

[x] I have verified that the issue exists against the main branch of AllenNLP.
[x] I have read the relevant section in the contribution guide on reporting bugs.
[x] I have checked the issues list for similar or identical bug reports.
[x] I have checked the pull requests list for existing proposed fixes.
[x] I have checked the CHANGELOG and the commit log to find out if the bug was already fixed in the main branch.
[x] I have included in the "Description" section below a traceback from any exceptions related to this bug.
[x] I have included in the "Related issues or possible duplicates" section below all related issues and possible duplicate issues (If there are none, check this box anyway).
[x] I have included in the "Environment" section below the name of the operating system and Python version that I was using when I discovered this bug.
[x] I have included in the "Environment" section below the output of pip freeze.
[x] I have included in the "Steps to reproduce" section below a minimally reproducible example.

Description

I was using allennlp-models==2.3.0 and training with the script allennlp-models/training_scripts/generation/bart_cnn_dm.jsonnet. And I've got the following performance output:

{
  "best_epoch": 0,
  "peak_worker_0_memory_MB": 77679.125,
  "peak_gpu_0_memory_MB": 19497.5625,
  "training_duration": "1 day, 9:30:58.908029",
  "training_start_epoch": 0,
  "training_epochs": 2,
  "epoch": 2,
  "training_loss": 2.7070548462225283,
  "training_worker_0_memory_MB": 77679.125,
  "training_gpu_0_memory_MB": 19497.5625,
  "validation_ROUGE-1_R": 0.4871779537805129,
  "validation_ROUGE-2_R": 0.26309701739882685,
  "validation_ROUGE-1_P": 0.3966578995429105,
  "validation_ROUGE-2_P": 0.21290430088784706,
  "validation_ROUGE-1_F1": 0.4283963905120849,
  "validation_ROUGE-2_F1": 0.23045514136364303,
  "validation_ROUGE-L": 0.3206116030616199,
  "validation_BLEU": 0.18484394329002954,
  "validation_loss": 0.0,
  "best_validation_ROUGE-1_R": 0.47620558012437575,
  "best_validation_ROUGE-2_R": 0.25229075181929206,
  "best_validation_ROUGE-1_P": 0.38737318484205874,
  "best_validation_ROUGE-2_P": 0.20447094269175353,
  "best_validation_ROUGE-1_F1": 0.41917399613391276,
  "best_validation_ROUGE-2_F1": 0.22154245158723443,
  "best_validation_ROUGE-L": 0.31225680111602044,
  "best_validation_BLEU": 0.17805890029860716,
  "best_validation_loss": 0.0
}

However, according to the implementation from fairseq (and what's reported in the paper), the Rouge-1/2/L score should be 44.16/21.28/40.90, so that the Rouge-L score is 9.7 points below the reference value, while Rouge-1/2 scores have some improvements.

I am wondering if this is expected and why the Rouge-L score is significantly worse. Is this an issue with how BART models are implemented in allennlp-models or how Rouge-L score is computed in allennlp.training.metrics.

Related issues or possible duplicates

None

Environment

OS: Linux

Python version: 3.8.8

Output of pip freeze:

(I've created an environment with only `allennlp==2.3.0` and no other pkgs installed) ``` allennlp==2.3.0 attrs==20.3.0 blis==0.7.4 boto3==1.17.53 botocore==1.20.53 catalogue==2.0.3 certifi==2020.12.5 chardet==4.0.0 click==7.1.2 configparser==5.0.2 conllu==4.4 cymem==2.0.5 docker-pycreds==0.4.0 filelock==3.0.12 ftfy==6.0.1 gitdb==4.0.7 GitPython==3.1.14 h5py==3.2.1 idna==2.10 iniconfig==1.1.1 Jinja2==2.11.3 jmespath==0.10.0 joblib==1.0.1 jsonnet==0.17.0 lmdb==1.2.0 MarkupSafe==1.1.1 more-itertools==8.7.0 murmurhash==1.0.5 nltk==3.6.1 numpy==1.20.2 overrides==3.1.0 packaging==20.9 pathtools==0.1.2 pathy==0.4.0 Pillow==8.2.0 pluggy==0.13.1 preshed==3.0.5 promise==2.3 protobuf==3.15.8 psutil==5.8.0 py==1.10.0 pydantic==1.7.3 pyparsing==2.4.7 py-rouge==1.1 pytest==6.2.3 python-dateutil==2.8.1 PyYAML==5.4.1 regex==2021.4.4 requests==2.25.1 s3transfer==0.3.7 sacremoses==0.0.45 scikit-learn==0.24.1 scipy==1.6.2 sentencepiece==0.1.95 sentry-sdk==1.0.0 shortuuid==1.0.1 six==1.15.0 smart-open==3.0.0 smmap==4.0.0 spacy==3.0.5 spacy-legacy==3.0.2 srsly==2.4.1 subprocess32==3.5.4 tensorboardX==2.2 thinc==8.0.2 threadpoolctl==2.1.0 tokenizers==0.10.2 toml==0.10.2 torch==1.8.1+cu111 torchvision==0.9.1 tqdm==4.60.0 transformers==4.5.1 typer==0.3.2 typing-extensions==3.7.4.3 urllib3==1.26.4 wandb==0.10.26 wasabi==0.8.2 wcwidth==0.2.5 word2number==1.1 ```

Steps to reproduce

Create an environment with python==3.8 and allennlp==2.3.0
Clone the allennlp-models repo
Run allennlp train training_scripts/generaion/bart_cnn_dm.jsonnet -s tmp-save-dir --include-package allennlp_models

Example source:

``` ```

niansong1996 commented 3 years ago

So I have run some extra experiments to locate the potential issue, the caveat is that the following experiments are not conducted on CNN/DM, but on another smaller dataset.

The rouge evaluation is less likely to be the issue because I've tried to evaluate the same output with AllenNLP's rouge and other rouge packages, including the one with the original Perl implementation and the difference is minimal;
I've noticed a few differences in the hyperparameters, including grad_norm seems to be set to 0.1 in fairseq, and it is 1.0 in AllenNLP's training config. But I didn't find the performance to change much even if I change that to 0.1
There is a warm-up phase in the fairseq implementation, but I haven't found the clear definition of how this is done;

Now my main focus is on the fact that fairseq's BART allows 2048 tokens for the input, but I believe the huggingface version of BART used by AllenNLP only allows 1024, this could make a big difference.

ArjunSubramonian commented 3 years ago

Have you discovered anything else with your extra experiments? Assuming that the validation loss converges fine and doesn't immediately spike due to overfitting, warm-up shouldn't make a huge difference. The difference in number of allowed tokens could have an effect.

niansong1996 commented 3 years ago

@ArjunSubramonian I haven't got more updates than that on the experiments. But I've found that people usually extend the positional embedding of BART from 1024 to 2048 to adapt to longer input (in which case I think they did for CNN/DM). I haven't found the corresponding code in fairseq, but I've found this code in the longformer repo. Hope we can do the same for the BART in AllenNLP.

Also, just out of curiosity, when do we typically need warmup steps?

ArjunSubramonian commented 3 years ago

@niansong1996 would you be interested in implementing the sparse attention mechanism used by Longformer? I think it would be a great addition to our library.

There are a variety of reasons that people use LR warmup. You can read some of the responses here. People tend to use warmup when working with models that have attention mechanisms (to prevent your attention weights from overfitting or collapsing to just a few relations early in the training process) and when using adaptive optimizers (to improve gradient statistics).

niansong1996 commented 3 years ago

I am not particularly familiar with Longformer and the sparse attention mechanism it uses. However, I can try to implement the positional embedding extension as an option for the BART model in AllenNLP, if such a thing hasn't been done yet. Can you or someone else confirm that this is needed?

Thanks a lot for the pointer and explanation for LR warmup!

ArjunSubramonian commented 3 years ago

@dirkgr thoughts on implementing the positional embedding extension as an option for the BART model in AllenNLP?

dirkgr commented 3 years ago

The extended positional embeddings would be great. I assume it would be relatively self-contained, just resizing the embedding matrix during model initialization?

niansong1996 commented 3 years ago

@dirkgr Basically, yes. But it's a little bit more tricky than that, some say that if we want to extend 1024 to 2048, we need to copy the first 1024 embedding and initialize the rest, other people added that copying it two times works better. Also, BART has position 0,1 reserved so we need to account for that as well.

I will experiment a bit with different implementations and see what works.

niansong1996 commented 3 years ago

@dirkgr So I managed to extend the positional embedding for BART by looking at how Longformer did it. Now I am trying to mess with the Tokenzier and I can't seem to find a way to configure the tokenizer in the .jsonnet file for the max_len and model_max_len to be more than 1024, I think it will always be reset to whatever here configured it to be.

So I am simply changing those params after the tokenizer is actually initialized, but I don't know if this is right since I still get errors like Token indices sequence length is longer than the specified maximum sequence length for this model (1780 > 1024). Running this sequence through the model will result in indexing errors. Or maybe all of these don't matter since Tokenizer doesn't have an internal cutoff mechanism?

dirkgr commented 3 years ago

Is this just about the warning? I think the tokenizers will tokenize just fine, but they want to print that warning. If you know of a way to silence the warning, I'd love to put it in.

github-actions[bot] commented 3 years ago

This issue is being closed due to lack of activity. If you think it still needs to be addressed, please comment on this thread 👇

dirkgr commented 3 years ago

@niansong1996, any updates?

niansong1996 commented 3 years ago

@dirkgr Sorry for the delay. It seems to me that the model is now accepting 2048 tokens after the positional embedding extension, despite the warnings. However, I still can not match AllenNLP's BART to the fairseq version (on my task, ~2 points ROUGE-2 and ROUGE-L difference, which is quite large).

I am a little bit tied up with some paper deadlines, but I will make a PR about the positional embedding early next week since it should be useful for other purposes as well.

danieldeutsch commented 3 years ago

@niansong1996 Check out the changes I made in #5207 and #5208. I was able to match the BART paper's scores using these hyperparameters for min/max sequence length and the length penalties. I actually didn't need to implement the trigram blocking 🤷‍♂️

danieldeutsch commented 3 years ago

Also to reproduce the ROUGE-L result, you actually need to sentence split the summaries before running ROUGE.

niansong1996 commented 3 years ago

@danieldeutsch This is so cool! Thanks a lot!

About ROUGE-L, currently, the AllenNLP version directly takes the token ids in its tensor form (see here), are you using other packages for ROUGE computation or you have a workaround?

danieldeutsch commented 3 years ago

The AllenNLP ROUGE is fine for training, but you should really use the original package for calculating the final scores on the test set. I think the AllenNLP version operates on the token ids, which could be problematic if your model uses subtokens for its vocabulary. ROUGE is also typically calculated using a Porter Stemmer on the tokens, which also isn't done here.

Shameless plug: I use a library that I developed to calculate ROUGE (https://github.com/danieldeutsch/sacrerouge). It has a wrapper around the original perl code. See here

dirkgr commented 3 years ago

Is this something we could adopt as a metric in AllenNLP? The metric API would require that we pass in two strings (or two lists of strings), one with the actual results, and one with the target results. Inside the metric we'd have to tokenize, porter-stem, and calculate ROUGE.

"Wrapper around perl" sounds a bit scary to me though. That's going to generate a lot of questions on Stack Overflow and GitHub from people whose perl installation is broken. Is there a faithful reimplementation in Python that we could use?

danieldeutsch commented 3 years ago

niansong1996 commented 3 years ago

Yes, I can confirm that the AllenNLP version of ROUGE somewhat differs from the original perl version, I observe ~1 point difference in ROUGE-1 and ROUGE-L on my task. But I wouldn't recommend using a version with wrapper around perl, at least not for real-time evaluation. And I agree with @danieldeutsch on the rouge-score package, which yields ~0.1 difference from the perl version for my experiments.

Should we open another issue to talk about ROUGE score though?

On this issue, I am going to test on #5207 and #5208 this weekend and see if they fixed the performance mismatch problem.

dirkgr commented 3 years ago

There is no point in worrying about differences in the implementation of ROUGE while one method uses stemmed tokens, and another uses word pieces. Let's get a version running that also uses stemmed tokens, and then see if the numbers are still different.

dirkgr commented 3 years ago

Oh, I see that https://pypi.org/project/rouge-score/ does all of it, tokenization, stemming, the lot. Then it should be easy to use it in an AllenNLP metric.

niansong1996 commented 3 years ago

@danieldeutsch I have just tried to use the latest AllenNLP version with #5207 and #5208 merged. I am using rouge-score as the package for computing ROUGE scores, but I still can't get the numbers reported by fairseq. My current ROUGE-1/2/L scores are 43.2/20.6/30.4. My configuration is as follows:

 1 local model_name = "facebook/bart-large";
  2 local data_base_url = "https://storage.googleapis.com/allennlp-public-data/cnndm-combined-data-2020.07.13.tar.gz";
  3 local train_data = data_base_url + "!cnndm-combined-data-2020.07.13/url_lists/all_train.txt";
  4 local dev_data = data_base_url + "!cnndm-combined-data-2020.07.13/url_lists/all_val.txt";
  5
  6 {
  7     "train_data_path": train_data,
  8     "validation_data_path": dev_data,
  9     "dataset_reader": {
 10         "type": "cnn_dm",
 11         "source_tokenizer": {
 12             "type": "pretrained_transformer",
 13             "model_name": model_name
 14         },
 15         "source_token_indexers": {
 16             "tokens": {
 17                 "type": "pretrained_transformer",
 18                 "model_name": model_name,
 19                 "namespace": "tokens"
 20             }
 21         },
 22         "source_max_tokens": 1022,
 23         "target_max_tokens": 140,
 24         // "max_instances": 1000 // DEBUG setting
 25     },
 26     "model": {
 27         "type": "bart_new_rouge",
 28         "model_name": model_name,
 29         "max_decoding_steps": 140,
 30         "min_decoding_steps": 55,
 31         "beam_size": 4,
 32     },
 33     "data_loader": {
 34         "batch_size": 2,
 35         "shuffle": true
 36     },
 37     "trainer": {
 38         "num_epochs": 3,
 39         "optimizer": {
 40             "type": "huggingface_adamw",
 41             "lr": 3e-5,
 42             "betas": [0.9, 0.999],
 43             "eps": 1e-8,
 44             "correct_bias": true
 45         },
 46         "learning_rate_scheduler": {
 47             "type": "polynomial_decay",
 48         },
 49         "grad_norm": 1.0,
 50         "cuda_device": 0,
 51         "run_confidence_checks": false,
 52         "validation_metric": "+NEW_ROUGE-L"
 53     }
 54 }

Does this look similar to the setting you were running? Also, you mentioned that sentence split is needed to run ROUGE-L, but the rouge-score package only provided the API as scorer.score(tgt: str, src: str), so I am wondering what's the correct way of running it?

Currently, I am simply measuring the ROUGE scores by comparing outputs["predicted_text"] and self._indexer._tokenizer.batch_decode(target_ids, skip_special_tokens=True) at the end of the forward function of the BART model.

I also noticed that there is a lenpen in the fairseq version here, did you also implemented this?

danieldeutsch commented 3 years ago

Yes, you need to use the length penalty of 2.0 via the LengthNormalizedSequenceLogProbabilityScorer. I think that should make the difference.

I have never used the Google ROUGE package myself, but the note in the Readme explains how to reproduce the Perl ROUGE-L.

dirkgr commented 3 years ago

@niansong1996, your implementation of NEW_ROUGE, it uses the Google package? Can you point me to that code? I want to see how hard it would be to make it an official AllenNLP metric.

niansong1996 commented 3 years ago

@niansong1996, your implementation of NEW_ROUGE, it uses the Google package? Can you point me to that code? I want to see how hard it would be to make it an official AllenNLP metric.

Yes, this is using the https://pypi.org/project/rouge-score/. So what I was doing in my code is to have a Average metric as a wrapper and call the rouge-score package in the forward function. I think it should be super easy to write an AllenNLP metric out of it.

allenai / allennlp