Train a model with transformer embeddings and additional_special_tokens

pvcastro commented 3 years ago

Checklist

[x] I have verified that the issue exists against the master branch of AllenNLP.
[x] I have read the relevant section in the contribution guide on reporting bugs.
[x] I have checked the issues list for similar or identical bug reports.
[x] I have checked the pull requests list for existing proposed fixes.
[x] I have checked the CHANGELOG and the commit log to find out if the bug was already fixed in the master branch.
[x] I have included in the "Description" section below a traceback from any exceptions related to this bug.
[x] I have included in the "Related issues or possible duplicates" section beloew all related issues and possible duplicate issues (If there are none, check this box anyway).
[x] I have included in the "Environment" section below the name of the operating system and Python version that I was using when I discovered this bug.
[x] I have included in the "Environment" section below the output of pip freeze.
[x] I have included in the "Steps to reproduce" section below a minimally reproducible example.

Description

Hi there! I'm trying to train a transformer-based text classifier model in AllenNLP, but I need to add 5 additional special tokens, in a way compatible with tokenizers lib. I tried adding them to the jsonnet AllenNLP config file and then to the transformer's model path, but neither worked, with each approach having a different problem, which will be described below.

Python traceback:

``` 2020-09-30 23:56:17,398 - INFO - allennlp.training.trainer - Epoch 0/9 2020-09-30 23:56:17,398 - INFO - allennlp.training.trainer - Worker 0 memory usage MB: 10065.304 2020-09-30 23:56:17,484 - WARNING - allennlp.common.util - unable to check gpu_memory_mb() due to occasional failure, continuing Traceback (most recent call last): File "/media/discoD/repositorios/allennlp/allennlp/common/util.py", line 415, in gpu_memory_mb encoding="utf-8", File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/subprocess.py", line 411, in check_output **kwargs).stdout File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/subprocess.py", line 488, in run with Popen(*popenargs, **kwargs) as process: File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/subprocess.py", line 800, in __init__ restore_signals, start_new_session) File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/subprocess.py", line 1482, in _execute_child restore_signals, start_new_session, preexec_fn) File "/media/discoD/pycharm-community-2019.2/plugins/python-ce/helpers/pydev/_pydev_bundle/pydev_monkey.py", line 526, in new_fork_exec return getattr(_posixsubprocess, original_name)(args, *patch_fork_exec_executable_list(args, other_args)) OSError: [Errno 12] Cannot allocate memory 2020-09-30 23:56:17,489 - INFO - allennlp.training.trainer - Training 0%| | 0/11817 [00:00

Related issues or possible duplicates

None

Environment

OS: Linux

Python version: 3.7.7

Output of pip freeze:

``` allennlp==1.1.0 allennlp-models==1.1.0 -e git+git@github.com:allenai/allennlp-server.git@bc56288b9295391051f7b7b042fe34219bfe33ab#egg=allennlp_server attrs==19.3.0 backcall==0.2.0 bleach==3.1.5 blis==0.4.1 boto3==1.14.31 botocore==1.17.31 cachetools==4.1.1 catalogue==1.0.0 certifi==2020.6.20 chardet==3.0.4 click==7.1.2 conllu==4.1 cycler==0.10.0 cymem==2.0.3 cytoolz==0.10.1 decorator==4.4.2 defusedxml==0.6.0 docutils==0.15.2 eland==7.7.0a1 elasticsearch-dsl==7.2.1 en-core-web-sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.3.1/en_core_web_sm-2.3.1.tar.gz entrypoints==0.3 filelock==3.0.12 fire==0.3.1 Flask==1.1.2 Flask-Cors==3.0.8 ftfy==5.8 future==0.18.2 gevent==20.6.2 greenlet==0.4.16 h5py==2.10.0 idna==2.10 importlib-metadata==1.7.0 iniconfig==1.0.1 ipykernel==5.3.4 ipython==7.16.1 ipython-genutils==0.2.0 ipywidgets==7.5.1 itsdangerous==1.1.0 jedi==0.17.2 jellyfish==0.8.2 Jinja2==2.11.2 jmespath==0.10.0 joblib==0.16.0 jsonnet==0.16.0 jsonpickle==1.4.1 jsonschema==3.2.0 jupyter-client==6.1.6 jupyter-core==4.6.3 Keras==2.4.3 kiwisolver==1.2.0 MarkupSafe==1.1.1 matplotlib==3.3.0 mistune==0.8.4 mkl-fft==1.1.0 mkl-random==1.1.1 mkl-service==2.3.0 more-itertools==8.4.0 murmurhash==1.0.2 nbconvert==5.6.1 nbformat==5.0.7 networkx==2.4 nltk==3.5 notebook==6.0.3 numpy==1.18.5 olefile==0.46 overrides==3.1.0 packaging==20.4 pandas==1.1.0 pandocfilters==1.4.2 parso==0.7.1 pexpect==4.8.0 pickleshare==0.7.5 Pillow==7.2.0 plac==1.1.3 pluggy==0.13.1 preshed==3.0.2 prometheus-client==0.8.0 prompt-toolkit==3.0.5 protobuf==3.12.4 ptyprocess==0.6.0 py==1.9.0 py-rouge==1.1 pydot==1.4.1 pyemd==0.5.1 Pygments==2.6.1 pyparsing==2.4.7 Pyphen==0.9.5 pyrsistent==0.16.0 pytest==6.0.1 python-dateutil==2.8.1 pytz==2020.1 PyYAML==5.3.1 pyzmq==19.0.1 regex==2020.7.14 requests==2.24.0 s3transfer==0.3.3 sacremoses==0.0.43 scikit-learn==0.23.1 scipy==1.5.2 seaborn==0.11.0 Send2Trash==1.5.0 sentencepiece==0.1.91 seqeval==0.0.12 six==1.15.0 spacy==2.3.2 srsly==1.0.2 tensorboardX==2.1 termcolor==1.1.0 terminado==0.8.3 testpath==0.4.4 thinc==7.4.1 threadpoolctl==2.1.0 tokenizers==0.8.1rc1 toml==0.10.1 toolz==0.10.0 torch==1.6.0+cu101 torchvision==0.7.0+cu101 tornado==6.0.4 tqdm==4.48.0 traitlets==4.3.3 transformers==3.0.2 urllib3==1.25.10 visualise-spacy-tree==0.0.6 wasabi==0.7.1 wcwidth==0.2.5 webencodings==0.5.1 Werkzeug==1.0.1 widgetsnbextension==3.5.1 word2number==1.1 zipp==3.1.0 zope.event==4.4 zope.interface==5.1.0 ```

Steps to reproduce

First I tried adding the 5 additional special tokens directly in the jsonnet model config, like this:

    "token_indexers": {
            "tokens": {
                "type": "pretrained_transformer",
                "model_name": transformer_model,
                "max_length": transformer_dim,
                "tokenizer_kwargs": {"additional_special_tokens": [['<REL_SEP>'], ['[['], [']]'], ['<<'], ['>>']], "max_len": transformer_dim}
            }
     },

But I ran into a problem at allennlp.common.cached_transformer.get_tokenizer, because cache_key = (model_name, frozenset(kwargs.items())) tries to use the "tokenizer_kwargs" value as a cache key, but it can't parse the additional_special_tokens list into a string, throwing the following exception:

TypeError: unhashable type: 'list'

``` Traceback (most recent call last): File "/media/discoD/pycharm-community-2019.2/plugins/python-ce/helpers/pydev/pydevd.py", line 1465, in _exec runpy._run_module_as_main(module_name, alter_argv=False) File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/media/discoD/repositorios/allennlp/allennlp/__main__.py", line 38, in run() File "/media/discoD/repositorios/allennlp/allennlp/__main__.py", line 34, in run main(prog="allennlp") File "/media/discoD/repositorios/allennlp/allennlp/commands/__init__.py", line 94, in main args.func(args) File "/media/discoD/repositorios/allennlp/allennlp/commands/train.py", line 118, in train_model_from_args file_friendly_logging=args.file_friendly_logging, File "/media/discoD/repositorios/allennlp/allennlp/commands/train.py", line 177, in train_model_from_file file_friendly_logging=file_friendly_logging, File "/media/discoD/repositorios/allennlp/allennlp/commands/train.py", line 238, in train_model file_friendly_logging=file_friendly_logging, File "/media/discoD/repositorios/allennlp/allennlp/commands/train.py", line 433, in _train_worker local_rank=process_rank, File "/media/discoD/repositorios/allennlp/allennlp/common/from_params.py", line 599, in from_params **extras, File "/media/discoD/repositorios/allennlp/allennlp/common/from_params.py", line 626, in from_params kwargs = create_kwargs(constructor_to_inspect, cls, params, **extras) File "/media/discoD/repositorios/allennlp/allennlp/common/from_params.py", line 197, in create_kwargs cls.__name__, param_name, annotation, param.default, params, **extras File "/media/discoD/repositorios/allennlp/allennlp/common/from_params.py", line 306, in pop_and_construct_arg return construct_arg(class_name, name, popped_params, annotation, default, **extras) File "/media/discoD/repositorios/allennlp/allennlp/common/from_params.py", line 340, in construct_arg return annotation.from_params(params=popped_params, **subextras) File "/media/discoD/repositorios/allennlp/allennlp/common/from_params.py", line 599, in from_params **extras, File "/media/discoD/repositorios/allennlp/allennlp/common/from_params.py", line 626, in from_params kwargs = create_kwargs(constructor_to_inspect, cls, params, **extras) File "/media/discoD/repositorios/allennlp/allennlp/common/from_params.py", line 197, in create_kwargs cls.__name__, param_name, annotation, param.default, params, **extras File "/media/discoD/repositorios/allennlp/allennlp/common/from_params.py", line 306, in pop_and_construct_arg return construct_arg(class_name, name, popped_params, annotation, default, **extras) File "/media/discoD/repositorios/allennlp/allennlp/common/from_params.py", line 387, in construct_arg **extras, File "/media/discoD/repositorios/allennlp/allennlp/common/from_params.py", line 340, in construct_arg return annotation.from_params(params=popped_params, **subextras) File "/media/discoD/repositorios/allennlp/allennlp/common/from_params.py", line 599, in from_params **extras, File "/media/discoD/repositorios/allennlp/allennlp/common/from_params.py", line 628, in from_params return constructor_to_call(**kwargs) # type: ignore File "/media/discoD/repositorios/allennlp/allennlp/data/token_indexers/pretrained_transformer_indexer.py", line 58, in __init__ model_name, tokenizer_kwargs=tokenizer_kwargs File "/media/discoD/repositorios/allennlp/allennlp/data/tokenizers/pretrained_transformer_tokenizer.py", line 71, in __init__ model_name, add_special_tokens=False, **tokenizer_kwargs File "/media/discoD/repositorios/allennlp/allennlp/common/cached_transformers.py", line 101, in get_tokenizer cache_key = (model_name, frozenset(kwargs.items())) TypeError: unhashable type: 'list' ```

I couldn't find a way to work passing the tokens in this way, so I ended up downloading the bert model to my local disk and added the tokenizers config files to the same path (the vocab size of my bert model is 29794, so the last index is 29793). Files contents I changed are in the "Example source" section below.

After debugging, looks like this config at least was enough to get the bert tokenizer to recognize the 5 tokens and tokenize the training data accordingly, but then I ran into another issue once training actually began (the one pasted in the "Python traceback" section of this issue).

Looks like this error is due to the fact that the transformer's model embeddings layer weren't properly resized according to the new vocabulary size, which would be accomplished with a code like this: model.resize_token_embeddings(len(tokenizer)). I didn't find any code in the AllenNLP lib that would do something like this, so I'm thinking this is the issue's cause.

Is there another way to accomplish this using AllenNLP that I'm not aware of? Looks like both ways to expand the vocab size should be possible.

Example source:

`added_tokens.json`: `{"": 29794, "[[": 29795, "]]": 29796, "<<": 29797, ">>": 29798}` `special_tokens_map.json`: `{"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "additional_special_tokens": ["", "[[", "]]", "<<", ">>"]}` `tokenizer_config.json`: `{"do_lower_case": false, "additional_special_tokens": ["", "[[", "]]", "<<", ">>"]}` ``` ```

Thanks!

github-actions[bot] commented 3 years ago

@dirkgr this is just a friendly ping to make sure you haven't forgotten about this issue 😜

pvcastro commented 3 years ago

@dirkgr ? :sweat_smile:

tomsherborne commented 3 years ago

@pvcastro I've managed to approximately get something working like this by writing a wrapper Model class for the huggingface model that calls the HF function model.resize_token_embeddings(new_vocab_size) at the end of the constructor as well as forcibly creating a custom AllenNLP vocab and a Huggingface Vocab. My use case is Bart so I've modified the bart.py file in allennlp-models. I'm now just verifying if this doesn't remove/break the pretrained weight initialisation. I'm not sure if this is useful for you - but let me know if it helps!

pvcastro commented 3 years ago

hi @tomsherborne ! So you had to hardcode tokenizer_kwargs directly in .py to pass additional_special_tokens?

tomsherborne commented 3 years ago

@pvcastro yes you could do that. i found it more stable/manageable to manually create an instance of the relevant HF tokenizer, use add_tokens to manually add each extra token i needed and save this to disk. Then in the ANLP config file I reference a local path under the "model_name" argument for tokenizer/indexer. I also converted this HF vocabulary into an ANLP vocabulary and reference this in the config as:

    "vocabulary": {
        "type": "from_files", 
        "directory": "./path/to/allennlp/version/of/extended/vocab",
        "oov_token": "<unk>",
        "padding_token": "<pad>"
    }

pvcastro commented 3 years ago

I see. I guess I'll do the same then, until this is supported by the library. Thanks! @tomsherborne !

epwalsh commented 3 years ago

@pvcastro just FYI, Dirk is currently on vacation and won't be back for another week.

@tomsherborne it seems like it would be useful to have a from_pretrained_transformer vocab type? Then you could just give it the model name or path:

"vocabulary": {
    "type": "from_pretrained_transformer",
    "model_name": "bert-base-cased"
}

This feature request has come up elsewhere as well (see here, for example).

tomsherborne commented 3 years ago

@epwalsh i may have some aspects of the vocabulary API wrong here but I understood that a specific type was unneccessary right now because the PretrainedTransformerIndexer will force a copy of the pretrained vocabulary into the AllenNLP vocab namespace here. Would you also need a way of forcing the correct OOV/Pad tokens from the pretrained version when they aren't the default AllenNLP strings?

epwalsh commented 3 years ago

@tomsherborne thanks for pointing that out, I actually wasn't aware of that.

Would you also need a way of forcing the correct OOV/Pad tokens from the pretrained version when they aren't the default AllenNLP strings?

Hmm do you need that for your use case? Right now the Vocabulary object just has a single padding_token field, but we would really need to be able to specify a different padding_token for each namespace in the Vocabulary. I think that's very doable. More generally, we could have optional namespace-specific settings which could override the default. I'm just wondering if there's a need for that.

NicolasAG commented 3 years ago

Hi, I just want to add my interest in seeing this issue resolved :)

I have the exact same issue as originally stated: trying to add special tokens to the T5 pretrained tokenizer with this extra argument in my tokenizer config:

"tokenizer_kwargs": { "additional_special_tokens": ["##START##", "##END##", ...] }

but that fails because the list type is unhashable.

However, I am not sure to understand the trick of creating a Model wrapper, a new Vocab file, etc... as suggested by @tomsherborne

I guess another solution could be to: (1) manually add the special tokens when creating the data reader:

special_tokens_dict = {'additional_special_tokens': ['##START##', '##END##', ...]}
num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)
logger.info(f'We have added {num_added_toks} tokens')

and then (2) when creating the model we resize the BasicTextFieldEmbedder.token_embbeder dimension like this:

text_embedder.t5.resize_token_embeddings(len(vocab))

this is because I have the following config for my model in which I use the pretrained t5 encoder as an embedder:

"source_text_embedder": {
    "type": "basic",
    "token_embedders": {
        "t5": { "type": "pretrained_transformer", "model_name": model_name, .... }
    }
}

Would that work? will the vocab passed to the model constructor have the additional tokens....? Anyway, I'll try it out and see how I can work around this issue, but I'm also really looking forward to this being properly supported by the ANLP library :)

pvcastro commented 3 years ago

Hi @NicolasAG . I ended up doing something similar. In my custom reader class I did this in init:

        self._tokenizer = PretrainedTransformerTokenizer(
            model_name=transformer_model_name,
            tokenizer_kwargs=tokenizer_kwargs,
        )
        self._token_indexers = {
            "tokens": PretrainedTransformerIndexer(
                transformer_model_name, tokenizer_kwargs=tokenizer_kwargs
            )
        }
        special_tokens_dict = {'additional_special_tokens': ['<REL_SEP>']}
        self._tokenizer.tokenizer.add_special_tokens(special_tokens_dict)
        self._token_indexers["tokens"]._allennlp_tokenizer.tokenizer.add_special_tokens(special_tokens_dict)

In the model I did this:

        default_vocab_size = self._embedder.token_embedder_tokens.config.vocab_size
        self._embedder.token_embedder_tokens.transformer_model.resize_token_embeddings(default_vocab_size + 1)

NicolasAG commented 3 years ago

Thanks for sharing! I'll try it out :)

dirkgr commented 3 years ago

Can you see if #4781 fixes your original issue?

pvcastro commented 3 years ago

Hi @dirkgr , thanks. I'll give it a try and get back to you.

pvcastro commented 3 years ago

@dirkgr I think it would still be necessary to do something like was done here

self._embedder.token_embedder_tokens.transformer_model.resize_token_embeddings

otherwise we'll end up with other errors due to the transformer model having a different dimension, not adjusted to the new vocabulary

pvcastro commented 3 years ago

I did confirm that #4781 fixes the parsing of additional tokens from config, but, as I stated in the previous comment, it's still necessary to resize the embeddings layer to accomodate the new vocab.

tomsherborne commented 3 years ago

@dirkgr would a PR which handles the special case of extending a pretrained HuggingFace model embedder be useful? The AllenNLP model class already does this (here I think) but this doesn't handle the HuggingFace model case since the HF embeddings don't have the "extend_vocab" attribute. Coul this function could call pretrained_model.resize_token_embeddings(new_vocab_size) as a special case?

dirkgr commented 3 years ago

@tomsherborne, yes, that would be a welcome addition. I'd be happy to review that PR. I think that's not enough by itself, since we still have to call those new methods then, but it would be easy to do so.

dirkgr commented 3 years ago

This is not closed yet. #4781 only goes part of the way.

tomsherborne commented 3 years ago

@dirkgr just to let you know that I will have time for this after EMNLP this week. To get ahead of things - where should I be writing tests for an addition to the Model class?

dirkgr commented 3 years ago

I think the best place is here: https://github.com/allenai/allennlp/blob/master/tests/models/model_test.py

There is already one test in there about extending vocabs!

github-actions[bot] commented 3 years ago

@dirkgr this is just a friendly ping to make sure you haven't forgotten about this issue 😜

dirkgr commented 3 years ago

@tomsherborne, did you get anywhere with this?

tomsherborne commented 3 years ago

@dirkgr Yes apologies - it's in the works and I'll open a PR when its done.

NicolasAG commented 3 years ago

Hi, I am also trying to figure out a way to work around this issue. I recently updated my allennlp library to version 1.3 and I can also confirm that #4781 fixes the parsing of additional tokens from config, but I am not sure if we need to call the pretrained_model.resize_token_embeddings(new_vocab_size) method. Indeed, when I print (form my model class constructor) text_field_embedder.token_embedder_pretrained_transformer.config.vocab_size I get the correct number of tokens, that is 32102 (from t5-base) + n with n being the number of tokens in my "additional_special_tokens" config param. So the text field embedder seems to know that I have extra tokens ✔️

~However, the issue I find is the following:~ ~After initializing my model (initializer(self) in the constructor) I have the following debug lines to make sure everything is ok:~

logger.info(len(vocab.get_token_to_index_vocabulary(namespace)))  # shows "2": probably just the oov_token and pad_token
test1 = "a test string that uses some additional_special_tokens..."
tokenized_test1 = tokenizer.tokenize(test1)
logger.info([t.text for t in tokenized_test1])  # shows the correct tokenization
logger.info(len(vocab.get_token_to_index_vocabulary(namespace)))  # shows "2" probably just the oov_token and pad_token
tokenids_test1 = indexers["pretrained_transformer"].tokens_to_indices(tokenized_test1, vocab)
logger.info(tokenids_test1)  # shows the correct tokenization with token_ids > 32102 (the initital vocab size of t5-base)
logger.info(len(vocab.get_token_to_index_vocabulary(namespace)))  # shows 32102 --> why not 32102+n ??
detokenized1 = indexers["pretrained_transformer"].indices_to_tokens(tokenids_test1, vocab)  # throws a KeyError on special tokens that have an id >= 32102 :(

~This last line throws a KeyError on token_ids >= 32102. It seems like the tokens_to_indices() function works fine but not the indices_to_tokens(), is this behavior expected..?~ [EDIT]: this was because I didn't specified any tokenizer_kwargs to the TokenIndexer nor the TokenEmbedder.

I'm curious to know where you are at @tomsherborne and if you observe something similar? Thanks

pvcastro commented 3 years ago

@NicolasAG doesn't this work if you do the resize? In my scenario I used the tokenizer_kwargs for both the tokenizer and indexer in the dataset reader and for the embedder in the actual model. In the model, only using tokenizer_kwargs wasn't enough, I had to do the resize_token_embeddings as well.

NicolasAG commented 3 years ago

@pvcastro Thanks! you reminded me that I forgot using tokenizer_kwargs in the TokenIndexer and the TokenEmbedder! in my config file I only specified the extra tokens to the Tokenizer but not the indexer nor the embedder... This is my config now:

    "source_tokenizer": {
        "type": "pretrained_transformer",
        "model_name": "t5-base",
        "tokenizer_kwargs": { "additional_special_tokens": extra_tokens, },
    },
    "source_token_indexers": {
        "pretrained_transformer": {
            "type": "pretrained_transformer",
            "model_name": "t5-base",
            "tokenizer_kwargs": { "additional_special_tokens": extra_tokens, }
        }
    },
    "source_text_embedder": {
            "type": "basic",
            "token_embedders": {
                "pretrained_transformer": {
                    "type": "pretrained_transformer",
                    "model_name": "t5-base",
                    "tokenizer_kwargs": { "additional_special_tokens": extra_tokens, }
                },
            },
        },

and it works fine without doing the resize ✔️

pvcastro commented 3 years ago

Now that #4946 got merged, even though I was the one that opened this issue, I'd like your input before closing it. What do you think @dirkgr and @tomsherborne ?

dirkgr commented 3 years ago

I think both sub-issues have been resolved, but I never had a local repro of the problem. Does it work now end-to-end?

pvcastro commented 3 years ago

For me it's all good now. I was just wondering because @tomsherborne was preparing an additional contribution.

tomsherborne commented 3 years ago

@pvcastro I think your solution is good. I was working on something but other commitments have ended up taking higher priority. Apologies that this never came to fruition.

pvcastro commented 3 years ago

Thanks for the effort @tomsherborne !

allenai / allennlp