PretrainedTransformerTokenizer doesn't work with seq2seq dataset reader

JohnGiorgi commented 3 years ago

Checklist

[X] I have verified that the issue exists against the master branch of AllenNLP.
[X] I have read the relevant section in the contribution guide on reporting bugs.
[X] I have checked the issues list for similar or identical bug reports.
[X] I have checked the pull requests list for existing proposed fixes.
[X] I have checked the CHANGELOG and the commit log to find out if the bug was already fixed in the master branch.
[X] I have included in the "Description" section below a traceback from any exceptions related to this bug.
[X] I have included in the "Related issues or possible duplicates" section below all related issues and possible duplicate issues (If there are none, check this box anyway).
[X] I have included in the "Environment" section below the name of the operating system and Python version that I was using when I discovered this bug.
[X] I have included in the "Environment" section below the output of pip freeze.
[X] I have included in the "Steps to reproduce" section below a minimally reproducible example.

Description

As far as I can tell, PretrainedTransformerTokenizer is not compatible with the seq2seq dataset reader of allennlp-models when it is used as the source_tokenizer. The same error, contained in this try/except block here is triggered in multiple cases.

When allennlp.common.util.START_SYMBOL and allennlp.common.util.END_SYMBOL are not in the pretrained transformers vocabulary. I was able to solve this in the config as follows:

    "dataset_reader": {
        "type": "copynet_seq2seq",
        "target_namespace": "target_tokens",
        "source_tokenizer": {
            "type": "pretrained_transformer",
            "model_name": "distilroberta-base",
            "tokenizer_kwargs": {
                "additional_special_tokens": {
                    "allennlp_start_symbol": "@start@",
                    "allennlp_end_symbol": "@end@",
                },
            }
        },

If PretrainedTransformerTokenizer.add_special_tokens is True (the default) for wordpiece-based tokenizers.
For any BPE-based tokenizer I tried.

The error arises because there are more than two tokens in the list returned by self._source_tokeniser.tokenizer in the try/except block for all cases listed above:

try:
    self._start_token, self._end_token = self._source_tokenizer.tokenize(
        start_symbol + " " + end_symbol
    )
except ValueError:
    raise ValueError(
        f"Bad start or end symbol ('{start_symbol}', '{end_symbol}') "
        f"for tokenizer {self._source_tokenizer}"
    )

Python traceback:

``` 2020-11-16 16:53:24,760 - CRITICAL - root - Uncaught exception Traceback (most recent call last): File "/project/6006286/johnmg/allennlp-models/allennlp_models/generation/dataset_readers/seq2seq.py", line 98, in __init__ self._start_token, self._end_token = self._source_tokenizer.tokenize( ValueError: too many values to unpack (expected 2) During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/johnmg/seq2rel/bin/allennlp", line 33, in sys.exit(load_entry_point('allennlp', 'console_scripts', 'allennlp')()) File "/project/6006286/johnmg/allennlp/allennlp/__main__.py", line 34, in run main(prog="allennlp") File "/project/6006286/johnmg/allennlp/allennlp/commands/__init__.py", line 118, in main args.func(args) File "/project/6006286/johnmg/allennlp/allennlp/commands/train.py", line 110, in train_model_from_args train_model_from_file( File "/project/6006286/johnmg/allennlp/allennlp/commands/train.py", line 170, in train_model_from_file return train_model( File "/project/6006286/johnmg/allennlp/allennlp/commands/train.py", line 236, in train_model model = _train_worker( File "/project/6006286/johnmg/allennlp/allennlp/commands/train.py", line 453, in _train_worker train_loop = TrainModel.from_params( File "/project/6006286/johnmg/allennlp/allennlp/common/from_params.py", line 595, in from_params return retyped_subclass.from_params( File "/project/6006286/johnmg/allennlp/allennlp/common/from_params.py", line 627, in from_params kwargs = create_kwargs(constructor_to_inspect, cls, params, **extras) File "/project/6006286/johnmg/allennlp/allennlp/common/from_params.py", line 198, in create_kwargs constructed_arg = pop_and_construct_arg( File "/project/6006286/johnmg/allennlp/allennlp/common/from_params.py", line 305, in pop_and_construct_arg return construct_arg(class_name, name, popped_params, annotation, default, **extras) File "/project/6006286/johnmg/allennlp/allennlp/common/from_params.py", line 339, in construct_arg return annotation.from_params(params=popped_params, **subextras) File "/project/6006286/johnmg/allennlp/allennlp/common/from_params.py", line 595, in from_params return retyped_subclass.from_params( File "/project/6006286/johnmg/allennlp/allennlp/common/from_params.py", line 629, in from_params return constructor_to_call(**kwargs) # type: ignore File "/project/6006286/johnmg/allennlp-models/allennlp_models/generation/dataset_readers/seq2seq.py", line 102, in __init__ raise ValueError( ValueError: Bad start or end symbol ('@start@', '@end@') for tokenizer ```

Related issues or possible duplicates

None

Environment

OS: Linux

Python version: 3.8.0

Output of pip freeze:

```bash -f /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/nix/avx2 -f /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/nix/generic -f /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/generic -e git+https://github.com/allenai/allennlp.git@0d8873cfef628eaf0457bee02422bbf8dae475a2#egg=allennlp -e git+https://github.com/allenai/allennlp-models.git@236034ff54ac3197ec4d710438cebdfa919c5a45#egg=allennlp_models attrs==20.2.0 blis==0.4.1 boto3==1.16.2 botocore==1.19.2 certifi==2020.6.20 chardet==3.0.4 click==7.1.2 conllu==4.2.1 cymem==2.0.2 filelock==3.0.12 ftfy==5.5.1 future==0.18.2 h5py==2.10.0 idna==2.10 importlib-metadata==2.0.0 iniconfig==1.0.1 jmespath==0.10.0 joblib==0.17.0 jsonnet==0.14.0 jsonpickle==1.4.1 more-itertools==8.5.0 murmurhash==1.0.2 nltk==3.5 numpy==1.19.1 overrides==3.1.0 packaging==20.4 plac==1.1.3 pluggy==0.13.1 preshed==3.0.2 protobuf==3.13.0 py==1.9.0 py-rouge==1.1 pyparsing==2.4.7 pytest==6.0.1 python-dateutil==2.8.1 regex==2019.11.1 requests==2.24.0 s3transfer==0.3.3 sacremoses==0.0.43 scikit-learn==0.23.0 scipy==1.5.2 sentencepiece==0.1.91 six==1.15.0 spacy==2.2.2 srsly==0.2.0 tensorboardX==2.1 thinc==7.3.1 threadpoolctl==2.1.0 tokenizers==0.9.3 toml==0.10.1 torch==1.7.0 tqdm==4.51.0 transformers==3.5.1 typing-extensions==3.7.4.3 urllib3==1.25.10 wasabi==0.6.0 wcwidth==0.2.5 word2number==1.1 zipp==3.4.0 ```

Steps to reproduce

Example source:

The proximate cause of the error can be reproduced as follows: ```python from allennlp.data.tokenizers import PretrainedTransformerTokenizer from allennlp.common.util import START_SYMBOL, END_SYMBOL tokenizer_kwargs = {"additional_special_tokens": [START_SYMBOL, END_SYMBOL]} # Case 1, don't add AllenNLPs start/end symbols to vocabulary tokenizer = PretrainedTransformerTokenizer("bert-base-uncased") start_token, end_token = tokenizer.tokenize(START_SYMBOL + " " + END_SYMBOL) # Case 2, set add_special_tokens=True (the default) in PretrainedTransformerTokenizer for a wordpiece based tokenizer # this WON'T fail tokenizer = PretrainedTransformerTokenizer("bert-base-uncased", tokenizer_kwargs=tokenizer_kwargs, add_special_tokens=False) start_token, end_token = tokenizer.tokenize(START_SYMBOL + " " + END_SYMBOL) # this WILL fail tokenizer = PretrainedTransformerTokenizer("bert-base-uncased", tokenizer_kwargs=tokenizer_kwargs, add_special_tokens=True) start_token, end_token = tokenizer.tokenize(START_SYMBOL + " " + END_SYMBOL) # Case 3, BPE-based tokenizers fail regardless # this WILL fail tokenizer = PretrainedTransformerTokenizer("distilroberta-base", tokenizer_kwargs=tokenizer_kwargs, add_special_tokens=False) start_token, end_token = tokenizer.tokenize(START_SYMBOL + " " + END_SYMBOL) # this WILL fail tokenizer = PretrainedTransformerTokenizer("distilroberta-base", tokenizer_kwargs=tokenizer_kwargs, add_special_tokens=True) start_token, end_token = tokenizer.tokenize(START_SYMBOL + " " + END_SYMBOL) ```

epwalsh commented 3 years ago

What if you just disable adding start or end symbols? https://github.com/allenai/allennlp-models/blob/236034ff54ac3197ec4d710438cebdfa919c5a45/allennlp_models/generation/dataset_readers/seq2seq.py#L64-L67

If that doesn't work for your use case, you could also change the start and end symbols to something that you know is in your transformer's vocab, like [SEP] or [CLS].

JohnGiorgi commented 3 years ago

What if you just disable adding start or end symbols? https://github.com/allenai/allennlp-models/blob/236034ff54ac3197ec4d710438cebdfa919c5a45/allennlp_models/generation/dataset_readers/seq2seq.py#L64-L67

At the very least, you would need target_add_end_token to be True so that the decoder can stop before it hits the max_decoding_length, right? If thats correct, then the ValueError is still raised even if source_add_start_token and source_add_end_token are False (I confirmed this).

If that doesn't work for your use case, you could also change the start and end symbols to something that you know is in your transformer's vocab, like [SEP] or [CLS].

Is there a clean way to change AllenNLPs START_SYMBOL AND END_SYMBOL? i.e. in a config or otherwise? I went looking but couldn't find it.

epwalsh commented 3 years ago

At the very least, you would need target_add_end_token to be True so that the decoder can stop before it hits the max_decoding_length, right?

Not necessarily. If your tokenizer adds its own end symbol, then the decoder can just go off of that.

Is there a clean way to change AllenNLPs START_SYMBOL AND END_SYMBOL? i.e. in a config or otherwise?

Yes, just set start_symbol and end_symbol in your dataset reader: https://github.com/allenai/allennlp-models/blob/236034ff54ac3197ec4d710438cebdfa919c5a45/allennlp_models/generation/dataset_readers/seq2seq.py#L68-L69

JohnGiorgi commented 3 years ago

Not necessarily. If your tokenizer adds its own end symbol, then the decoder can just go off of that.

Gotcha. That is probably the best solution for me.

Yes, just set start_symbol and end_symbol in your dataset reader: https://github.com/allenai/allennlp-models/blob/236034ff54ac3197ec4d710438cebdfa919c5a45/allennlp_models/generation/dataset_readers/seq2seq.py#L68-L69

Thanks. Not sure how I missed that.

It still seems like there is a bug here. BPE-based tokenizers will throw an error if at least one of source_add_start_token, source_add_end_token, target_add_start_token, or target_add_end_token is True, even if you took care to add AllenNLPs start and end symbols to its vocabulary, and you disable the tokenizer from adding its own special tokens with add_special_tokens=False:

from allennlp.data.tokenizers import PretrainedTransformerTokenizer
from allennlp.common.util import START_SYMBOL, END_SYMBOL

tokenizer_kwargs = {"additional_special_tokens": [START_SYMBOL, END_SYMBOL]}
tokenizer = PretrainedTransformerTokenizer("distilroberta-base", tokenizer_kwargs=tokenizer_kwargs, add_special_tokens=False)
tokenizer.tokenize(START_SYMBOL + " " + END_SYMBOL)
# >> [@start@, Ġ, @end@]

This is because the try/except block of Seq2SeqDatasetReader will fail as the tokenizer will not return a list of length two (it encodes the space as "Ġ").

What if the check was more explict. Something like:

tokens = self._source_tokenizer.tokenize(start_symbol + " " + end_symbol)
if tokens[0] != self._start_token or tokens[-1] != self._end_token:
raise ValueError(
  f"Bad start or end symbol ('{start_symbol}', '{end_symbol}') "
  f"for tokenizer {self._source_tokenizer}"
 )

epwalsh commented 3 years ago

I think that's a good idea. Want to make a PR? Feel free to tag me as a reviewer. I also just realized the start/end_symbol parameters are missing from the docstring. Would be good to update the docstring as well.

JohnGiorgi commented 3 years ago

Sounds good, I'll do both!

JohnGiorgi commented 3 years ago

@epwalsh I took a closer look at the start_symbol and end_symbol arguments in Seq2SeqDatasetReader, and it looks like they are never used (besides in the try/except)? Either this is a bug, or their purpose is not to update the default choices of START_SYMBOL and END_SYMBOL?

epwalsh commented 3 years ago

Hmm, it's a weird way of doing it, but the result of https://github.com/allenai/allennlp-models/blob/236034ff54ac3197ec4d710438cebdfa919c5a45/allennlp_models/generation/dataset_readers/seq2seq.py#L95-L97 is essentially the same as

self._start_symbol = start_symbol
self._end_symbol = end_symbol

allenai / allennlp