Open pvcastro opened 1 year ago
Hey @pvcastro, a couple questions:
(Ro|B)ertaForSequenceClassification
models?Hi @epwalsh , thanks for the feedback!
Gotcha! Oh yes, I meant BertForTokenClassification
, not BertForSequenceClassification
🤦
So I think the most likely source for a bug would be in the PretrainedTransformerMismatched(Embedder|TokenIndexer)
. And any differences between BERT and RoBERTa would probably have to do with tokenization. See, for example:
I was assuming that just running some unit tests from the AllenNLP repository, to confirm that these embedders/tokenizers are producing tokens with the same special tokens as RoBERTa architecture would be enough to discard these. I ran some tests using RoBERTa and confirmed that it's not relying on CLS. Was this too superficial to reach any conclusions?
I'm not sure. I mean, I thought we did have pretty good test coverage there, but I know for a fact that's one of the most brittle pieces of code in the whole library. It would break all of the time with new releases of transformers
. So that's my best guess.
Do you think it makes sense for me to run additional tests for the embedder comparing embeddings produced by a raw RobertaModel and the actual PretrainedTransformerMismatchedEmbedder? To try to see if they are somehow getting "corrupted" in the framework.
I guess I would start by looking very closely at the exact tokens that are being used for each word by the PretrainedTransformerMismatchedEmbedder
. Maybe pick out a couple test instances to check where the performance gap between the BERT and RoBERTa predictions is largest.
Ok, thanks! I'll try testing something like this and will report back.
This issue is being closed due to lack of activity. If you think it still needs to be addressed, please comment on this thread 👇
This issue is being closed due to lack of activity. If you think it still needs to be addressed, please comment on this thread 👇
Sorry, I'll try to get back to this next week, haven't had the time yet :disappointed:
No rush, I thought adding the "question" label would stop @github-actions bot from closing this, but I guess not.
Checklist
main
branch of AllenNLP.pip freeze
.Description
I've started using AllenNLP since 2018, and I have already run thousands of NER benchmarks with it...since ELMo, and following with transformers, it's CrfTagger model has always yielded superior results in every possible benchmark for this task. However, since my research group trained different RoBERTa models for Portuguese, we have been conducting benchmarks comparing them with an existing BERT model, but we have been getting inconsistent results compared to other frameworks, such as huggingface's transformers.
Sorted results for AllenNLP grid search on CoNLL2003 using optuna (all berts' results are better than all the robertas'): Sorted results for huggingface's transformers grid search on CoNLL2003 (all robertas' results are better than all the berts'):
I originally opened this as a question on stackoverflow, as suggested in the issues guidelines (additional details already provided there), but I have failed to discover the problem by myself. I have run several unit tests from AllenNLP, concerning the tokenizers and embedders, and couldn't notice anything wrong, but I'm betting something is definetely wrong in the training process, since the results are so inferior for non-BERT models.
Although I'm reporting details with the current release version, I'd like to point out that I had already run this CoNLL 2003 benchmark with RoBERTa/AllenNLP a long time ago too, so it's not something new. At the time the results for RoBERTa were quite below bert-base, but at the time I just thought RoBERTa wasn't competitive for NER (which is not true at all).
It is expected that the results using AllenNLP are at least as good as the ones obtained using huggingface's framework.
Related issues or possible duplicates
Environment
OS: Linux
Python version: 3.8.13
Output of
pip freeze
:``` aiohttp==3.8.1 aiosignal==1.2.0 alembic==1.8.1 allennlp==2.10.0 allennlp-models==2.10.0 allennlp-optuna==0.1.7 asttokens==2.0.8 async-timeout==4.0.2 attrs==21.2.0 autopage==0.5.1 backcall==0.2.0 base58==2.1.1 blis==0.7.8 bokeh==2.4.3 boto3==1.24.67 botocore==1.27.67 cached-path==1.1.5 cachetools==5.2.0 catalogue==2.0.8 certifi @ file:///opt/conda/conda-bld/certifi_1655968806487/work/certifi charset-normalizer==2.1.1 click==8.1.3 cliff==4.0.0 cloudpickle==2.2.0 cmaes==0.8.2 cmd2==2.4.2 colorama==0.4.5 colorlog==6.7.0 commonmark==0.9.1 conllu==4.4.2 converters-datalawyer==0.1.10 cvxopt==1.2.7 cvxpy==1.2.1 cycler==0.11.0 cymem==2.0.6 Cython==0.29.32 datasets==2.4.0 debugpy==1.6.3 decorator==5.1.1 deprecation==2.1.0 dill==0.3.5.1 dkpro-cassis==0.7.2 docker-pycreds==0.4.0 ecos==2.0.10 elasticsearch==7.13.0 emoji==2.0.0 en-core-web-sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.3.0/en_core_web_sm-3.3.0-py3-none-any.whl entrypoints==0.4 executing==1.0.0 fairscale==0.4.6 filelock==3.7.1 fire==0.4.0 fonttools==4.37.1 frozenlist==1.3.1 fsspec==2022.8.2 ftfy==6.1.1 future==0.18.2 gensim==4.2.0 gitdb==4.0.9 GitPython==3.1.27 google-api-core==2.8.2 google-auth==2.11.0 google-cloud-core==2.3.2 google-cloud-storage==2.5.0 google-crc32c==1.5.0 google-resumable-media==2.3.3 googleapis-common-protos==1.56.4 greenlet==1.1.3 h5py==3.7.0 hdbscan==0.8.28 huggingface-hub==0.8.1 hyperopt==0.2.7 idna==3.3 importlib-metadata==4.12.0 importlib-resources==5.4.0 inceptalytics==0.1.0 iniconfig==1.1.1 ipykernel==6.15.2 ipython==8.5.0 jedi==0.18.1 Jinja2==3.1.2 jmespath==1.0.1 joblib==1.1.0 jsonnet==0.18.0 jupyter-core==4.11.1 jupyter_client==7.3.5 kiwisolver==1.4.4 krippendorff==0.5.1 langcodes==3.3.0 llvmlite==0.39.1 lmdb==1.3.0 lxml==4.9.1 Mako==1.2.2 MarkupSafe==2.1.1 matplotlib==3.5.3 matplotlib-inline==0.1.6 more-itertools==8.12.0 multidict==6.0.2 multiprocess==0.70.13 murmurhash==1.0.8 nest-asyncio==1.5.5 networkx==2.8.6 nltk==3.7 numba==0.56.2 numpy==1.23.3 optuna==2.10.1 osqp==0.6.2.post5 overrides==6.2.0 packaging==21.3 pandas==1.4.4 parso==0.8.3 pathtools==0.1.2 pathy==0.6.2 pbr==5.10.0 pexpect==4.8.0 pickleshare==0.7.5 Pillow==9.2.0 pluggy==1.0.0 preshed==3.0.7 prettytable==3.4.1 promise==2.3 prompt-toolkit==3.0.31 protobuf==3.20.0 psutil==5.9.2 pt-core-news-sm @ https://github.com/explosion/spacy-models/releases/download/pt_core_news_sm-3.3.0/pt_core_news_sm-3.3.0-py3-none-any.whl ptyprocess==0.7.0 pure-eval==0.2.2 py==1.11.0 py-rouge==1.1 py4j==0.10.9.7 pyannote.core==4.5 pyannote.database==4.1.3 pyarrow==9.0.0 pyasn1==0.4.8 pyasn1-modules==0.2.8 pycaprio==0.2.1 pydantic==1.8.2 pygamma-agreement==0.5.6 Pygments==2.13.0 pympi-ling==1.70.2 pyparsing==3.0.9 pyperclip==1.8.2 pytest==7.1.3 python-dateutil==2.8.2 pytz==2022.2.1 PyYAML==6.0 pyzmq==23.2.1 qdldl==0.1.5.post2 regex==2022.8.17 requests==2.28.1 requests-toolbelt==0.9.1 responses==0.18.0 rich==12.1.0 rsa==4.9 s3transfer==0.6.0 sacremoses==0.0.53 scikit-learn==1.1.2 scipy==1.9.1 scs==3.2.0 seaborn==0.12.0 sentence-transformers==2.2.2 sentencepiece==0.1.97 sentry-sdk==1.9.8 seqeval==1.2.2 setproctitle==1.3.2 shellingham==1.5.0 shortuuid==1.0.9 simplejson==3.17.6 six==1.16.0 sklearn==0.0 smart-open==5.2.1 smmap==5.0.0 sortedcontainers==2.4.0 spacy==3.3.1 spacy-legacy==3.0.10 spacy-loggers==1.0.3 split-datalawyer==0.1.80 SQLAlchemy==1.4.41 srsly==2.4.4 stack-data==0.5.0 stanza==1.4.0 stevedore==4.0.0 tensorboardX==2.5.1 termcolor==1.1.0 TextGrid==1.5 thinc==8.0.17 threadpoolctl==3.1.0 tokenizers==0.12.1 tomli==2.0.1 toposort==1.7 torch==1.13.0.dev20220911+cu117 torchvision==0.14.0.dev20220911+cu117 tornado==6.2 tqdm==4.64.1 traitlets==5.3.0 transformers==4.21.3 typer==0.4.2 typing_extensions==4.3.0 umap==0.1.1 Unidecode==1.3.4 urllib3==1.26.12 wandb==0.12.21 wasabi==0.10.1 wcwidth==0.2.5 word2number==1.1 xxhash==3.0.0 yarl==1.8.1 zipp==3.8.1 ```
Steps to reproduce
I'm attaching some parameters I used for running the CoNLL 2003 grid search.
Example source:
``` export BATCH_SIZE=8 export EPOCHS=10 export gradient_accumulation_steps=4 export dropout=0.2 export weight_decay=0 export seed=42 allennlp tune \ optuna_conll2003.jsonnet \ optuna-grid-search-conll2003-hparams.json \ --optuna-param-path optuna-grid-search-conll2003.json \ --serialization-dir /models/conll2003/benchmark_allennlp \ --study-name benchmark-allennlp-models-conll2003 \ --metrics test_f1-measure-overall \ --direction maximize \ --skip-if-exists \ --n-trials $1 ```
[optuna_conll2003.jsonnet](https://github.com/allenai/allennlp/files/9560749/optuna_conll2003.jsonnet.txt) [optuna-grid-search-conll2003.json](https://github.com/allenai/allennlp/files/9560750/optuna-grid-search-conll2003.json.txt) [optuna-grid-search-conll2003-hparams.json](https://github.com/allenai/allennlp/files/9560751/optuna-grid-search-conll2003-hparams.json.txt)