Closed JohnGiorgi closed 3 years ago
What if you just disable adding start or end symbols? https://github.com/allenai/allennlp-models/blob/236034ff54ac3197ec4d710438cebdfa919c5a45/allennlp_models/generation/dataset_readers/seq2seq.py#L64-L67
If that doesn't work for your use case, you could also change the start and end symbols to something that you know is in your transformer's vocab, like [SEP]
or [CLS]
.
What if you just disable adding start or end symbols? https://github.com/allenai/allennlp-models/blob/236034ff54ac3197ec4d710438cebdfa919c5a45/allennlp_models/generation/dataset_readers/seq2seq.py#L64-L67
At the very least, you would need target_add_end_token
to be True
so that the decoder can stop before it hits the max_decoding_length
, right? If thats correct, then the ValueError
is still raised even if source_add_start_token
and source_add_end_token
are False
(I confirmed this).
If that doesn't work for your use case, you could also change the start and end symbols to something that you know is in your transformer's vocab, like [SEP] or [CLS].
Is there a clean way to change AllenNLPs START_SYMBOL
AND END_SYMBOL
? i.e. in a config or otherwise? I went looking but couldn't find it.
At the very least, you would need target_add_end_token to be True so that the decoder can stop before it hits the max_decoding_length, right?
Not necessarily. If your tokenizer adds its own end symbol, then the decoder can just go off of that.
Is there a clean way to change AllenNLPs START_SYMBOL AND END_SYMBOL? i.e. in a config or otherwise?
Yes, just set start_symbol
and end_symbol
in your dataset reader: https://github.com/allenai/allennlp-models/blob/236034ff54ac3197ec4d710438cebdfa919c5a45/allennlp_models/generation/dataset_readers/seq2seq.py#L68-L69
Not necessarily. If your tokenizer adds its own end symbol, then the decoder can just go off of that.
Gotcha. That is probably the best solution for me.
Yes, just set start_symbol and end_symbol in your dataset reader: https://github.com/allenai/allennlp-models/blob/236034ff54ac3197ec4d710438cebdfa919c5a45/allennlp_models/generation/dataset_readers/seq2seq.py#L68-L69
Thanks. Not sure how I missed that.
It still seems like there is a bug here. BPE-based tokenizers will throw an error if at least one of source_add_start_token
,
source_add_end_token
, target_add_start_token
, or target_add_end_token
is True
, even if you took care to add AllenNLPs start and end symbols to its vocabulary, and you disable the tokenizer from adding its own special tokens with add_special_tokens=False
:
from allennlp.data.tokenizers import PretrainedTransformerTokenizer
from allennlp.common.util import START_SYMBOL, END_SYMBOL
tokenizer_kwargs = {"additional_special_tokens": [START_SYMBOL, END_SYMBOL]}
tokenizer = PretrainedTransformerTokenizer("distilroberta-base", tokenizer_kwargs=tokenizer_kwargs, add_special_tokens=False)
tokenizer.tokenize(START_SYMBOL + " " + END_SYMBOL)
# >> [@start@, Ġ, @end@]
This is because the try/except block of Seq2SeqDatasetReader
will fail as the tokenizer will not return a list of length two (it encodes the space as "Ġ"
).
What if the check was more explict. Something like:
tokens = self._source_tokenizer.tokenize(start_symbol + " " + end_symbol)
if tokens[0] != self._start_token or tokens[-1] != self._end_token:
raise ValueError(
f"Bad start or end symbol ('{start_symbol}', '{end_symbol}') "
f"for tokenizer {self._source_tokenizer}"
)
I think that's a good idea. Want to make a PR? Feel free to tag me as a reviewer. I also just realized the start/end_symbol
parameters are missing from the docstring. Would be good to update the docstring as well.
Sounds good, I'll do both!
@epwalsh I took a closer look at the start_symbol
and end_symbol
arguments in Seq2SeqDatasetReader
, and it looks like they are never used (besides in the try/except)? Either this is a bug, or their purpose is not to update the default choices of START_SYMBOL
and END_SYMBOL
?
Hmm, it's a weird way of doing it, but the result of https://github.com/allenai/allennlp-models/blob/236034ff54ac3197ec4d710438cebdfa919c5a45/allennlp_models/generation/dataset_readers/seq2seq.py#L95-L97 is essentially the same as
self._start_symbol = start_symbol
self._end_symbol = end_symbol
Checklist
master
branch of AllenNLP.pip freeze
.Description
As far as I can tell,
PretrainedTransformerTokenizer
is not compatible with theseq2seq
dataset reader ofallennlp-models
when it is used as thesource_tokenizer
. The same error, contained in this try/except block here is triggered in multiple cases.allennlp.common.util.START_SYMBOL
andallennlp.common.util.END_SYMBOL
are not in the pretrained transformers vocabulary. I was able to solve this in the config as follows:PretrainedTransformerTokenizer.add_special_tokens
isTrue
(the default) for wordpiece-based tokenizers.The error arises because there are more than two tokens in the list returned by
self._source_tokeniser.tokenizer
in the try/except block for all cases listed above:Python traceback:
``` 2020-11-16 16:53:24,760 - CRITICAL - root - Uncaught exception Traceback (most recent call last): File "/project/6006286/johnmg/allennlp-models/allennlp_models/generation/dataset_readers/seq2seq.py", line 98, in __init__ self._start_token, self._end_token = self._source_tokenizer.tokenize( ValueError: too many values to unpack (expected 2) During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/johnmg/seq2rel/bin/allennlp", line 33, in
sys.exit(load_entry_point('allennlp', 'console_scripts', 'allennlp')())
File "/project/6006286/johnmg/allennlp/allennlp/__main__.py", line 34, in run
main(prog="allennlp")
File "/project/6006286/johnmg/allennlp/allennlp/commands/__init__.py", line 118, in main
args.func(args)
File "/project/6006286/johnmg/allennlp/allennlp/commands/train.py", line 110, in train_model_from_args
train_model_from_file(
File "/project/6006286/johnmg/allennlp/allennlp/commands/train.py", line 170, in train_model_from_file
return train_model(
File "/project/6006286/johnmg/allennlp/allennlp/commands/train.py", line 236, in train_model
model = _train_worker(
File "/project/6006286/johnmg/allennlp/allennlp/commands/train.py", line 453, in _train_worker
train_loop = TrainModel.from_params(
File "/project/6006286/johnmg/allennlp/allennlp/common/from_params.py", line 595, in from_params
return retyped_subclass.from_params(
File "/project/6006286/johnmg/allennlp/allennlp/common/from_params.py", line 627, in from_params
kwargs = create_kwargs(constructor_to_inspect, cls, params, **extras)
File "/project/6006286/johnmg/allennlp/allennlp/common/from_params.py", line 198, in create_kwargs
constructed_arg = pop_and_construct_arg(
File "/project/6006286/johnmg/allennlp/allennlp/common/from_params.py", line 305, in pop_and_construct_arg
return construct_arg(class_name, name, popped_params, annotation, default, **extras)
File "/project/6006286/johnmg/allennlp/allennlp/common/from_params.py", line 339, in construct_arg
return annotation.from_params(params=popped_params, **subextras)
File "/project/6006286/johnmg/allennlp/allennlp/common/from_params.py", line 595, in from_params
return retyped_subclass.from_params(
File "/project/6006286/johnmg/allennlp/allennlp/common/from_params.py", line 629, in from_params
return constructor_to_call(**kwargs) # type: ignore
File "/project/6006286/johnmg/allennlp-models/allennlp_models/generation/dataset_readers/seq2seq.py", line 102, in __init__
raise ValueError(
ValueError: Bad start or end symbol ('@start@', '@end@') for tokenizer
```
Related issues or possible duplicates
Environment
OS: Linux
Python version: 3.8.0
Output of
pip freeze
:```bash -f /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/nix/avx2 -f /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/nix/generic -f /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/generic -e git+https://github.com/allenai/allennlp.git@0d8873cfef628eaf0457bee02422bbf8dae475a2#egg=allennlp -e git+https://github.com/allenai/allennlp-models.git@236034ff54ac3197ec4d710438cebdfa919c5a45#egg=allennlp_models attrs==20.2.0 blis==0.4.1 boto3==1.16.2 botocore==1.19.2 certifi==2020.6.20 chardet==3.0.4 click==7.1.2 conllu==4.2.1 cymem==2.0.2 filelock==3.0.12 ftfy==5.5.1 future==0.18.2 h5py==2.10.0 idna==2.10 importlib-metadata==2.0.0 iniconfig==1.0.1 jmespath==0.10.0 joblib==0.17.0 jsonnet==0.14.0 jsonpickle==1.4.1 more-itertools==8.5.0 murmurhash==1.0.2 nltk==3.5 numpy==1.19.1 overrides==3.1.0 packaging==20.4 plac==1.1.3 pluggy==0.13.1 preshed==3.0.2 protobuf==3.13.0 py==1.9.0 py-rouge==1.1 pyparsing==2.4.7 pytest==6.0.1 python-dateutil==2.8.1 regex==2019.11.1 requests==2.24.0 s3transfer==0.3.3 sacremoses==0.0.43 scikit-learn==0.23.0 scipy==1.5.2 sentencepiece==0.1.91 six==1.15.0 spacy==2.2.2 srsly==0.2.0 tensorboardX==2.1 thinc==7.3.1 threadpoolctl==2.1.0 tokenizers==0.9.3 toml==0.10.1 torch==1.7.0 tqdm==4.51.0 transformers==3.5.1 typing-extensions==3.7.4.3 urllib3==1.25.10 wasabi==0.6.0 wcwidth==0.2.5 word2number==1.1 zipp==3.4.0 ```
Steps to reproduce
Example source:
The proximate cause of the error can be reproduced as follows: ```python from allennlp.data.tokenizers import PretrainedTransformerTokenizer from allennlp.common.util import START_SYMBOL, END_SYMBOL tokenizer_kwargs = {"additional_special_tokens": [START_SYMBOL, END_SYMBOL]} # Case 1, don't add AllenNLPs start/end symbols to vocabulary tokenizer = PretrainedTransformerTokenizer("bert-base-uncased") start_token, end_token = tokenizer.tokenize(START_SYMBOL + " " + END_SYMBOL) # Case 2, set add_special_tokens=True (the default) in PretrainedTransformerTokenizer for a wordpiece based tokenizer # this WON'T fail tokenizer = PretrainedTransformerTokenizer("bert-base-uncased", tokenizer_kwargs=tokenizer_kwargs, add_special_tokens=False) start_token, end_token = tokenizer.tokenize(START_SYMBOL + " " + END_SYMBOL) # this WILL fail tokenizer = PretrainedTransformerTokenizer("bert-base-uncased", tokenizer_kwargs=tokenizer_kwargs, add_special_tokens=True) start_token, end_token = tokenizer.tokenize(START_SYMBOL + " " + END_SYMBOL) # Case 3, BPE-based tokenizers fail regardless # this WILL fail tokenizer = PretrainedTransformerTokenizer("distilroberta-base", tokenizer_kwargs=tokenizer_kwargs, add_special_tokens=False) start_token, end_token = tokenizer.tokenize(START_SYMBOL + " " + END_SYMBOL) # this WILL fail tokenizer = PretrainedTransformerTokenizer("distilroberta-base", tokenizer_kwargs=tokenizer_kwargs, add_special_tokens=True) start_token, end_token = tokenizer.tokenize(START_SYMBOL + " " + END_SYMBOL) ```