Open niansong1996 opened 3 years ago
So I have run some extra experiments to locate the potential issue, the caveat is that the following experiments are not conducted on CNN/DM, but on another smaller dataset.
grad_norm
seems to be set to 0.1 in fairseq, and it is 1.0 in AllenNLP's training config. But I didn't find the performance to change much even if I change that to 0.1Now my main focus is on the fact that fairseq's BART allows 2048 tokens for the input, but I believe the huggingface version of BART used by AllenNLP only allows 1024, this could make a big difference.
Have you discovered anything else with your extra experiments? Assuming that the validation loss converges fine and doesn't immediately spike due to overfitting, warm-up shouldn't make a huge difference. The difference in number of allowed tokens could have an effect.
@ArjunSubramonian I haven't got more updates than that on the experiments. But I've found that people usually extend the positional embedding of BART from 1024 to 2048 to adapt to longer input (in which case I think they did for CNN/DM). I haven't found the corresponding code in fairseq, but I've found this code in the longformer repo. Hope we can do the same for the BART in AllenNLP.
Also, just out of curiosity, when do we typically need warmup steps?
@niansong1996 would you be interested in implementing the sparse attention mechanism used by Longformer? I think it would be a great addition to our library.
There are a variety of reasons that people use LR warmup. You can read some of the responses here. People tend to use warmup when working with models that have attention mechanisms (to prevent your attention weights from overfitting or collapsing to just a few relations early in the training process) and when using adaptive optimizers (to improve gradient statistics).
I am not particularly familiar with Longformer and the sparse attention mechanism it uses. However, I can try to implement the positional embedding extension as an option for the BART model in AllenNLP, if such a thing hasn't been done yet. Can you or someone else confirm that this is needed?
Thanks a lot for the pointer and explanation for LR warmup!
@dirkgr thoughts on implementing the positional embedding extension as an option for the BART model in AllenNLP?
The extended positional embeddings would be great. I assume it would be relatively self-contained, just resizing the embedding matrix during model initialization?
@dirkgr Basically, yes. But it's a little bit more tricky than that, some say that if we want to extend 1024 to 2048, we need to copy the first 1024 embedding and initialize the rest, other people added that copying it two times works better. Also, BART has position 0,1 reserved so we need to account for that as well.
I will experiment a bit with different implementations and see what works.
@dirkgr So I managed to extend the positional embedding for BART by looking at how Longformer did it. Now I am trying to mess with the Tokenzier
and I can't seem to find a way to configure the tokenizer in the .jsonnet
file for the max_len
and model_max_len
to be more than 1024, I think it will always be reset to whatever here configured it to be.
So I am simply changing those params after the tokenizer is actually initialized, but I don't know if this is right since I still get errors like Token indices sequence length is longer than the specified maximum sequence length for this model (1780 > 1024). Running this sequence through the model will result in indexing errors.
Or maybe all of these don't matter since Tokenizer doesn't have an internal cutoff mechanism?
Is this just about the warning? I think the tokenizers will tokenize just fine, but they want to print that warning. If you know of a way to silence the warning, I'd love to put it in.
This issue is being closed due to lack of activity. If you think it still needs to be addressed, please comment on this thread 👇
@niansong1996, any updates?
@dirkgr Sorry for the delay. It seems to me that the model is now accepting 2048 tokens after the positional embedding extension, despite the warnings. However, I still can not match AllenNLP's BART to the fairseq version (on my task, ~2 points ROUGE-2 and ROUGE-L difference, which is quite large).
I am a little bit tied up with some paper deadlines, but I will make a PR about the positional embedding early next week since it should be useful for other purposes as well.
@niansong1996 Check out the changes I made in #5207 and #5208. I was able to match the BART paper's scores using these hyperparameters for min/max sequence length and the length penalties. I actually didn't need to implement the trigram blocking 🤷♂️
Also to reproduce the ROUGE-L result, you actually need to sentence split the summaries before running ROUGE.
@danieldeutsch This is so cool! Thanks a lot!
About ROUGE-L, currently, the AllenNLP version directly takes the token ids in its tensor form (see here), are you using other packages for ROUGE computation or you have a workaround?
The AllenNLP ROUGE is fine for training, but you should really use the original package for calculating the final scores on the test set. I think the AllenNLP version operates on the token ids, which could be problematic if your model uses subtokens for its vocabulary. ROUGE is also typically calculated using a Porter Stemmer on the tokens, which also isn't done here.
Shameless plug: I use a library that I developed to calculate ROUGE (https://github.com/danieldeutsch/sacrerouge). It has a wrapper around the original perl code. See here
Is this something we could adopt as a metric in AllenNLP? The metric API would require that we pass in two strings (or two lists of strings), one with the actual results, and one with the target results. Inside the metric we'd have to tokenize, porter-stem, and calculate ROUGE.
"Wrapper around perl" sounds a bit scary to me though. That's going to generate a lot of questions on Stack Overflow and GitHub from people whose perl installation is broken. Is there a faithful reimplementation in Python that we could use?
This one is popular https://github.com/google-research/google-research/tree/master/rouge. It's included in this pip package https://pypi.org/project/rouge-score/ and used by Huggingface https://github.com/huggingface/datasets/blob/67574a8d74796bc065a8b9b49ec02f7b1200c172/metrics/rouge/rouge.py
Yes, I can confirm that the AllenNLP version of ROUGE somewhat differs from the original perl version, I observe ~1 point difference in ROUGE-1 and ROUGE-L on my task. But I wouldn't recommend using a version with wrapper around perl, at least not for real-time evaluation. And I agree with @danieldeutsch on the rouge-score
package, which yields ~0.1 difference from the perl version for my experiments.
Should we open another issue to talk about ROUGE score though?
On this issue, I am going to test on #5207 and #5208 this weekend and see if they fixed the performance mismatch problem.
There is no point in worrying about differences in the implementation of ROUGE while one method uses stemmed tokens, and another uses word pieces. Let's get a version running that also uses stemmed tokens, and then see if the numbers are still different.
Oh, I see that https://pypi.org/project/rouge-score/ does all of it, tokenization, stemming, the lot. Then it should be easy to use it in an AllenNLP metric.
@danieldeutsch I have just tried to use the latest AllenNLP version with #5207 and #5208 merged. I am using rouge-score
as the package for computing ROUGE scores, but I still can't get the numbers reported by fairseq. My current ROUGE-1/2/L scores are 43.2/20.6/30.4. My configuration is as follows:
1 local model_name = "facebook/bart-large";
2 local data_base_url = "https://storage.googleapis.com/allennlp-public-data/cnndm-combined-data-2020.07.13.tar.gz";
3 local train_data = data_base_url + "!cnndm-combined-data-2020.07.13/url_lists/all_train.txt";
4 local dev_data = data_base_url + "!cnndm-combined-data-2020.07.13/url_lists/all_val.txt";
5
6 {
7 "train_data_path": train_data,
8 "validation_data_path": dev_data,
9 "dataset_reader": {
10 "type": "cnn_dm",
11 "source_tokenizer": {
12 "type": "pretrained_transformer",
13 "model_name": model_name
14 },
15 "source_token_indexers": {
16 "tokens": {
17 "type": "pretrained_transformer",
18 "model_name": model_name,
19 "namespace": "tokens"
20 }
21 },
22 "source_max_tokens": 1022,
23 "target_max_tokens": 140,
24 // "max_instances": 1000 // DEBUG setting
25 },
26 "model": {
27 "type": "bart_new_rouge",
28 "model_name": model_name,
29 "max_decoding_steps": 140,
30 "min_decoding_steps": 55,
31 "beam_size": 4,
32 },
33 "data_loader": {
34 "batch_size": 2,
35 "shuffle": true
36 },
37 "trainer": {
38 "num_epochs": 3,
39 "optimizer": {
40 "type": "huggingface_adamw",
41 "lr": 3e-5,
42 "betas": [0.9, 0.999],
43 "eps": 1e-8,
44 "correct_bias": true
45 },
46 "learning_rate_scheduler": {
47 "type": "polynomial_decay",
48 },
49 "grad_norm": 1.0,
50 "cuda_device": 0,
51 "run_confidence_checks": false,
52 "validation_metric": "+NEW_ROUGE-L"
53 }
54 }
Does this look similar to the setting you were running? Also, you mentioned that sentence split is needed to run ROUGE-L, but the rouge-score
package only provided the API as scorer.score(tgt: str, src: str)
, so I am wondering what's the correct way of running it?
Currently, I am simply measuring the ROUGE scores by comparing outputs["predicted_text"]
and self._indexer._tokenizer.batch_decode(target_ids, skip_special_tokens=True)
at the end of the forward function of the BART model.
I also noticed that there is a lenpen
in the fairseq version here, did you also implemented this?
Yes, you need to use the length penalty of 2.0 via the LengthNormalizedSequenceLogProbabilityScorer. I think that should make the difference.
I have never used the Google ROUGE package myself, but the note in the Readme explains how to reproduce the Perl ROUGE-L.
@niansong1996, your implementation of NEW_ROUGE
, it uses the Google package? Can you point me to that code? I want to see how hard it would be to make it an official AllenNLP metric.
@niansong1996, your implementation of
NEW_ROUGE
, it uses the Google package? Can you point me to that code? I want to see how hard it would be to make it an official AllenNLP metric.
Yes, this is using the https://pypi.org/project/rouge-score/. So what I was doing in my code is to have a Average
metric as a wrapper and call the rouge-score
package in the forward function. I think it should be super easy to write an AllenNLP metric out of it.
Checklist
main
branch of AllenNLP.pip freeze
.Description
I was using
allennlp-models==2.3.0
and training with the scriptallennlp-models/training_scripts/generation/bart_cnn_dm.jsonnet
. And I've got the following performance output:However, according to the implementation from fairseq (and what's reported in the paper), the Rouge-1/2/L score should be 44.16/21.28/40.90, so that the Rouge-L score is 9.7 points below the reference value, while Rouge-1/2 scores have some improvements.
I am wondering if this is expected and why the Rouge-L score is significantly worse. Is this an issue with how BART models are implemented in
allennlp-models
or how Rouge-L score is computed inallennlp.training.metrics
.Related issues or possible duplicates
Environment
OS: Linux
Python version: 3.8.8
Output of
pip freeze
:(I've created an environment with only `allennlp==2.3.0` and no other pkgs installed) ``` allennlp==2.3.0 attrs==20.3.0 blis==0.7.4 boto3==1.17.53 botocore==1.20.53 catalogue==2.0.3 certifi==2020.12.5 chardet==4.0.0 click==7.1.2 configparser==5.0.2 conllu==4.4 cymem==2.0.5 docker-pycreds==0.4.0 filelock==3.0.12 ftfy==6.0.1 gitdb==4.0.7 GitPython==3.1.14 h5py==3.2.1 idna==2.10 iniconfig==1.1.1 Jinja2==2.11.3 jmespath==0.10.0 joblib==1.0.1 jsonnet==0.17.0 lmdb==1.2.0 MarkupSafe==1.1.1 more-itertools==8.7.0 murmurhash==1.0.5 nltk==3.6.1 numpy==1.20.2 overrides==3.1.0 packaging==20.9 pathtools==0.1.2 pathy==0.4.0 Pillow==8.2.0 pluggy==0.13.1 preshed==3.0.5 promise==2.3 protobuf==3.15.8 psutil==5.8.0 py==1.10.0 pydantic==1.7.3 pyparsing==2.4.7 py-rouge==1.1 pytest==6.2.3 python-dateutil==2.8.1 PyYAML==5.4.1 regex==2021.4.4 requests==2.25.1 s3transfer==0.3.7 sacremoses==0.0.45 scikit-learn==0.24.1 scipy==1.6.2 sentencepiece==0.1.95 sentry-sdk==1.0.0 shortuuid==1.0.1 six==1.15.0 smart-open==3.0.0 smmap==4.0.0 spacy==3.0.5 spacy-legacy==3.0.2 srsly==2.4.1 subprocess32==3.5.4 tensorboardX==2.2 thinc==8.0.2 threadpoolctl==2.1.0 tokenizers==0.10.2 toml==0.10.2 torch==1.8.1+cu111 torchvision==0.9.1 tqdm==4.60.0 transformers==4.5.1 typer==0.3.2 typing-extensions==3.7.4.3 urllib3==1.26.4 wandb==0.10.26 wasabi==0.8.2 wcwidth==0.2.5 word2number==1.1 ```
Steps to reproduce
python==3.8
andallennlp==2.3.0
allennlp-models
repoallennlp train training_scripts/generaion/bart_cnn_dm.jsonnet -s tmp-save-dir --include-package allennlp_models
Example source:
``` ```