[Models] Wrong usage of *cls_pooler* in the SST Roberta model

dcfidalgo commented 4 years ago

Checklist

[x] I have verified that the issue exists against the master branch of AllenNLP.
[x] I have read the relevant section in the contribution guide on reporting bugs.
[x] I have checked the issues list for similar or identical bug reports.
[x] I have checked the pull requests list for existing proposed fixes.
[x] I have checked the CHANGELOG and the commit log to find out if the bug was already fixed in the master branch.
[ ] I have included in the "Description" section below a traceback from any exceptions related to this bug.
[ ] I have included in the "Related issues or possible duplicates" section beloew all related issues and possible duplicate issues (If there are none, check this box anyway).
[x] I have included in the "Environment" section below the name of the operating system and Python version that I was using when I discovered this bug.
[x] I have included in the "Environment" section below the output of pip freeze.
[x] I have included in the "Steps to reproduce" section below a minimally reproducible example.

Description

I think the usage of the cls_pooler as seq2vec_encoder is not appropriate in this model. If i am not mistaken the PretrainedTransformerMismatchedIndexer/Embedder get rid of the special tokens via the offsets, so the cls_pooler just takes the embedding of the first "real text" token.

Python traceback:

``` ```

Related issues or possible duplicates

None

Environment

OS: Ubuntu 20.04

Python version: 3.7.7

Output of pip freeze:

``` absl-py==0.9.0 aiohttp==3.6.2 alembic==1.4.2 allennlp==1.0.0 appdirs==1.4.4 astroid==2.4.2 async-timeout==3.0.1 attrs==19.3.0 azure-core==1.7.0 azure-storage-blob==12.3.2 backcall==0.2.0 beautifulsoup4==4.9.1 -e git+git@github.com:recognai/biome-text.git@7a22136a713f634587702e096f778ea44aa94123#egg=biome_text black==19.10b0 bleach==3.1.5 blis==0.4.1 bokeh==2.0.2 boto3==1.14.7 botocore==1.17.7 cachetools==4.1.1 cachey==0.2.1 captum==0.2.0 catalogue==1.0.0 certifi==2020.6.20 cffi==1.14.0 chardet==3.0.4 click==7.1.2 cloudpickle==1.4.1 colorama==0.4.3 coverage==5.1 cryptography==2.9.2 cycler==0.10.0 cymem==2.0.3 dask==2.17.2 dask-elk==0.4.0 databricks-cli==0.11.0 decorator==4.4.2 defusedxml==0.6.0 distributed==2.19.0 docker==4.2.2 docutils==0.15.2 elasticsearch==7.8.0 en-core-web-sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz entrypoints==0.3 fastapi==0.55.1 filelock==3.0.12 Flask==1.1.2 Flask-Cors==3.0.8 flatdict==4.0.1 fsspec==0.7.4 future==0.18.2 gevent==1.4.0 gitdb==4.0.5 GitPython==3.1.3 google==2.0.3 google-auth==1.18.0 google-auth-oauthlib==0.4.1 gorilla==0.3.0 greenlet==0.4.16 grpcio==1.30.0 gunicorn==20.0.4 h11==0.9.0 h5py==2.10.0 HeapDict==1.0.1 httptools==0.1.1 idna==2.9 importlib-metadata==1.6.1 importlib-resources==2.0.1 ipykernel==5.3.0 ipython==7.15.0 ipython-genutils==0.2.0 ipywidgets==7.5.1 isodate==0.6.0 isort==4.3.21 itsdangerous==1.1.0 jedi==0.17.1 Jinja2==2.11.2 jmespath==0.10.0 joblib==0.15.1 json5==0.9.5 jsonnet==0.16.0 jsonpickle==1.4.1 jsonschema==3.2.0 jupyter-client==6.1.3 jupyter-core==4.6.3 jupyterlab==2.1.5 jupyterlab-server==1.1.5 kiwisolver==1.2.0 lazy-object-proxy==1.4.3 locket==0.2.0 lxml==4.5.1 Mako==1.1.3 Markdown==3.2.2 MarkupSafe==1.1.1 matplotlib==3.2.2 mccabe==0.6.1 memory-profiler==0.57.0 mistune==0.8.4 mkl-fft==1.1.0 mkl-random==1.1.1 mkl-service==2.3.0 mlflow==1.9.1 more-itertools==8.4.0 msgpack==0.6.2 msrest==0.6.17 multidict==4.7.6 murmurhash==1.0.2 nbconvert==5.6.1 nbdime==2.0.0 nbformat==5.0.7 nltk==3.5 notebook==6.0.3 numpy==1.18.1 oauthlib==3.1.0 olefile==0.46 overrides==3.0.0 packaging==20.4 pandas==1.0.5 pandocfilters==1.4.2 parso==0.7.0 partd==1.1.0 pathspec==0.8.0 pdoc3==0.8.1 pexpect==4.8.0 pickleshare==0.7.5 Pillow==7.1.2 plac==1.1.3 pluggy==0.13.1 preshed==3.0.2 prometheus-client==0.8.0 prometheus-flask-exporter==0.14.1 prompt-toolkit==3.0.5 protobuf==3.12.2 psutil==5.7.0 ptyprocess==0.6.0 py==1.8.2 py-spy==0.3.3 pyarrow==0.17.1 pyasn1==0.4.8 pyasn1-modules==0.2.8 pycparser==2.20 pydantic==1.5.1 Pygments==2.6.1 pygraphviz==1.3 pylint==2.5.3 pyparsing==2.4.7 pyrsistent==0.16.0 pytest==5.4.3 pytest-cov==2.10.0 pytest-notebook==0.6.0 pytest-pylint==0.14.1 python-dateutil==2.8.1 python-editor==1.0.4 pytz==2020.1 PyYAML==5.3.1 pyzmq==19.0.1 querystring-parser==1.2.4 ray==0.8.6 redis==3.4.1 regex==2020.6.8 requests==2.24.0 requests-oauthlib==1.3.0 rsa==4.6 s3fs==0.4.2 s3transfer==0.3.3 sacremoses==0.0.43 scikit-learn==0.23.1 scipy==1.5.0 Send2Trash==1.5.0 sentencepiece==0.1.91 six==1.15.0 smmap==3.0.4 sortedcontainers==2.2.2 soupsieve==2.0.1 spacy==2.2.4 SQLAlchemy==1.3.13 sqlparse==0.3.1 srsly==1.0.2 starlette==0.13.2 tabulate==0.8.7 tblib==1.6.0 tensorboard==2.2.2 tensorboard-plugin-wit==1.7.0 tensorboardX==2.0 terminado==0.8.3 testpath==0.4.4 thinc==7.4.0 threadpoolctl==2.1.0 tokenizers==0.7.0 toml==0.10.1 toolz==0.10.0 torch==1.5.1 torchvision==0.6.0a0+35d732a tornado==6.0.4 tqdm==4.46.1 traitlets==4.3.3 transformers==2.11.0 typed-ast==1.4.1 typing-extensions==3.7.4.2 ujson==2.0.3 urllib3==1.25.9 uvicorn==0.11.5 uvloop==0.14.0 wasabi==0.7.0 wcwidth==0.2.4 webencodings==0.5.1 websocket-client==0.57.0 websockets==8.1 Werkzeug==1.0.1 widgetsnbextension==3.5.1 wrapt==1.12.1 xlrd==1.2.0 yarl==1.4.2 zict==2.0.0 zipp==3.1.0 ```

Steps to reproduce

Example source:

``` from allennlp.data.tokenizers import SpacyTokenizer from allennlp.data.token_indexers import PretrainedTransformerMismatchedIndexer from allennlp.data.fields import TextField from allennlp.data.vocabulary import Vocabulary from allennlp.data.instance import Instance from allennlp.data import Batch from allennlp.modules.token_embedders import PretrainedTransformerMismatchedEmbedder from allennlp.modules.text_field_embedders import BasicTextFieldEmbedder input_str = "Check this annoying string!" tokenizer = SpacyTokenizer() token_indexer = { "tokens": PretrainedTransformerMismatchedIndexer( model_name="distilroberta-base" ) } tf = TextField(tokenizer.tokenize(input_str), token_indexer) instance = Instance({"text": tf}) vocab = Vocabulary.from_instances([instance]) batch = Batch([instance]) batch.index_instances(vocab) padding_length = batch.get_padding_lengths() embedder = PretrainedTransformerMismatchedEmbedder( model_name="distilroberta-base" ) tf_embedder = BasicTextFieldEmbedder({"tokens": embedder}) tensor_dict = batch.as_tensor_dict(padding_length) embeddings = tf_embedder(tensor_dict["text"]) print(tf) print(tensor_dict) print(embeddings) ```

dirkgr commented 4 years ago

The model is fine. This line adds the special tokens we need.

The pooler has cls_is_last_token set to True, which is a questionable choice for RoBERTa, but it just means we get the embedding of the "</s>" token instead of "<s>". That's not the end of the world.

dcfidalgo commented 4 years ago

Thanks @dirkgr for the clarification, and sorry to bother you again! Maybe i do not get the full picture, but i would be super grateful if you could point me further in the right direction.

The model is fine. This line adds the special tokens we need.

If i understand it correctly, the PretrainedTransformerIndexer._process_output() method only triggers modifications if self._max_length is not None, but in the model jsonnet max_length is not specified and the default seems to be None.

Also, in the steps to reproduce the described behavior (see the issue description above) the embedder returns a torch.Size([1, 5, 768]) tensor, and i guess the 5 vectors belong to the 5 "word tokens" returned by the SpacyTokenizer.

The pooler has cls_is_last_token set to True, which is a questionable choice for RoBERTa, but it just means we get the embedding of the "</s>" token instead of "<s>". That's not the end of the world.

Is it not set to False? Maybe i am reading the jsonnet in a wrong way ...

Sorry again to bother you, and thank you for your time!

dirkgr commented 4 years ago

You are right about these things, but I see the correct tokens in the debugger. The sequences that end up in the forward() method all start with 0, as they should. I'm investigating ...

dirkgr commented 4 years ago

This is the line where the special tokens are added: https://github.com/allenai/allennlp/blob/master/allennlp/data/tokenizers/pretrained_transformer_tokenizer.py#L387

dirkgr commented 4 years ago

When I run your example, print(tensor_dict) also prints a tensor that starts with "<s>" (0) and ends with "</s>" (2). Looks like everything is alright?

In [6]: print(tensor_dict)
{'text': {'tokens': {'token_ids': tensor([[    0, 26615,  9226,  2279,  2160,   154, 20951,   328,     2]]), 'mask': tensor([[True, True, True, True, True]]), 'type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'wordpiece_mask': tensor([[True, True, True, True, True, True, True, True, True]]), 'offsets': tensor([[[1, 1],
         [2, 2],
         [3, 5],
         [6, 6],
         [7, 7]]])}}}

dcfidalgo commented 4 years ago

Thank you @dirkgr for investigating further, and sorry for my delayed answer! You are right, the output of the indexer does contain the word piece indexes, but i think key to the described behavior are the offsets and how the PretrainedTransformerMismatchedEmbedder uses them. If you set a breakpoint at the end of the forward method in a debugger, you can see that the first returned embedding vector does not correspond to the index 0 (<s>) token (compare embeddings with the returned orig_embeddings). This is due to the first offsets being [1, 1]. In the basic_classifier model we than pass these returned embeddings into the cls_pooler. Thanks again for your time and please let me know if i should provide other examples!

dirkgr commented 4 years ago

I get it now. You are right! I put a fix at https://github.com/allenai/allennlp-models/pull/99.

allenai / allennlp