ai-forever / ru-gpts

Russian GPT3 models.
Apache License 2.0
2.08k stars 444 forks source link

eos_token_id is None #28

Closed mikhovr closed 3 years ago

mikhovr commented 3 years ago

The tokenizer has a special token eos_token = '<|endoftext|>'. However, this token seems to be missing in vocabulary. Hence, when I try to encode a sequence like so:

tokenizer.encode("привет" + tokenizer.eos_token"), I get [960, 577, None] which follows to

Traceback (most recent call last):
  File "/home/superuser/khovrichev/gpt2bot/run_console_bot.py", line 14, in <module>
    run_bot(**config)
  File "/home/superuser/khovrichev/gpt2bot/gpt2bot/console_bot.py", line 66, in run_bot
    bot_messages = generate_text(prompt, pipeline, **generator_kwargs)
  File "/home/superuser/khovrichev/gpt2bot/gpt2bot/utils.py", line 301, in generate_text
    responses = generator.generate_next(prompt)
  File "/home/superuser/khovrichev/gpt2bot/gpt2bot/utils.py", line 103, in generate_next
    encoded_prompt = self.tokenizer.encode(prompt_text, add_special_tokens=False, return_tensors="pt")
  File "/home/superuser/khovrichev/ru-gpts/gpt_env/lib/python3.7/site-packages/transformers/tokenization_utils.py", line 919, in encode
    **kwargs,
  File "/home/superuser/khovrichev/ru-gpts/gpt_env/lib/python3.7/site-packages/transformers/tokenization_utils.py", line 1069, in encode_plus
    return_special_tokens_mask=return_special_tokens_mask,
  File "/home/superuser/khovrichev/ru-gpts/gpt_env/lib/python3.7/site-packages/transformers/tokenization_utils.py", line 1463, in prepare_for_model
    encoded_inputs["input_ids"] = torch.tensor([encoded_inputs["input_ids"]])
RuntimeError: Could not infer dtype of NoneType

Why eos_token is present in the tokenizer but is absent in vocab?

king-menin commented 3 years ago

We train models without this token (use only plain text). But first tokens are specials in vocab. And what version of transformers do you use?

mikhovr commented 3 years ago

To reproduce above behaviour, I used transformers==2.8.0 (from requirements.txt). My global goal is to make the model generate sentences separated from context, not just continue context sentence. For that I use text generation Pipeline from 3.4.0.

Moreover, the raw model tries to generate max_length tokens, even if the output is incomplete sentence. It seems that it doesn't consider </s> during beam search at all. I don't know if this behaviour is related.

The tokenizer attributes in 2.8.0 with using generate_transformers.py are:

<transformers.tokenization_gpt2.GPT2Tokenizer object at 0x7f11f646d9d0>
NO_PAD_TOKEN_FOR_BATCH_MSG = {str} 'No padding token is set for this model, therefore no batch can be made with uneven sequences. Set a padding token or adjust the lengths of the sequences building the batch so that every sequence is of the same length.'
SPECIAL_TOKENS_ATTRIBUTES = {list: 8} ['bos_token', 'eos_token', 'unk_token', 'sep_token', 'pad_token', 'cls_token', 'mask_token', 'additional_special_tokens']
UNEVEN_SEQUENCES_FOR_BATCH_MSG = {str} 'The sequences building the batch are not of the same size, no tensor can be built. Set `pad_to_max_length=True` to pad the smaller sequencesup to the larger sequence\'s length.'
added_tokens_decoder = {dict: 0} {}
added_tokens_encoder = {dict: 0} {}
additional_special_tokens = {list: 0} []
additional_special_tokens_ids = {list: 0} []
all_special_ids = {list: 1} [None]
all_special_tokens = {list: 1} ['<|endoftext|>']
bos_token = {str} '<|endoftext|>'
bos_token_id = {NoneType} None
bpe_ranks = {dict: 49996} 
byte_encoder = {dict: 256} 
cache = {dict: 0} {}
cls_token = {NoneType} None
cls_token_id = {NoneType} None
decoder = {dict: 50257} {0: '<pad>', 1: '<s>', 2: '</s>', 3: '<unk>', 4: '<mask>', 5: '!', 6: '"', ...
eos_token = {str} '<|endoftext|>'
eos_token_id = {NoneType} None
errors = {str} 'replace'
init_inputs = {tuple: 0} ()
init_kwargs = {dict: 2} {'vocab_file': '/home/superuser/khovrichev/ru-gpts/ckpt/gpt3_medium/vocab.json', 'merges_file': '/home/superuser/khovrichev/ru-gpts/ckpt/gpt3_medium/merges.txt'}
mask_token = {NoneType} None
mask_token_id = {NoneType} None
max_len = {int} 1000000000000
max_len_sentences_pair = {int} 1000000000000
max_len_single_sentence = {int} 1000000000000
max_model_input_sizes = {dict: 5} {'gpt2': 1024, 'gpt2-medium': 1024, 'gpt2-large': 1024, 'gpt2-xl': 1024, 'distilgpt2': 1024}
model_input_names = {list: 2} ['token_type_ids', 'attention_mask']
pad_token = {NoneType} None
pad_token_id = {NoneType} None
pad_token_type_id = {int} 0
padding_side = {str} 'right'
pat = {Pattern} regex.Regex("'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)|\\s+", flags=regex.V0)
pretrained_init_configuration = {dict: 0} {}
pretrained_vocab_files_map = {dict: 2} {'vocab_file': {'gpt2': 'https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json', 'gpt2-medium': 'https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-medium-vocab.json', 'gpt2-large': 'https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-large-vocab.json', 'gpt2-xl': 'https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-xl-vocab.json', 'distilgpt2': 'https://s3.amazonaws.com/models.huggingface.co/bert/distilgpt2-vocab.json'}, 'merges_file': {'gpt2': 'https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt', 'gpt2-medium': 'https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-medium-merges.txt', 'gpt2-large': 'https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-large-merges.txt', 'gpt2-xl': 'https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-xl-merges.txt', 'distilgpt2': 'https://s3.amazonaws.com/models.huggingface.co/bert/distilgpt2-merges.txt'}}
sep_token = {NoneType} None
sep_token_id = {NoneType} None
special_tokens_map = {dict: 3} {'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>'}
unique_added_tokens_encoder = {set: 1} {'<|endoftext|>'}
unk_token = {str} '<|endoftext|>'
unk_token_id = {NoneType} None
vocab_files_names = {dict: 2} {'vocab_file': 'vocab.json', 'merges_file': 'merges.txt'}
vocab_size = {int} 50257

The tokenizer attributes (in 3.4.0) with using pipeline are:


PreTrainedTokenizer(name_or_path='sberbank-ai/rugpt3small_based_on_gpt2', vocab_size=50257, model_max_len=1000000000000000019884624838656, is_fast=False, padding_side='right', special_tokens={'bos_token': AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'eos_token': AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'unk_token': AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True)})

SPECIAL_TOKENS_ATTRIBUTES = {list: 8} ['bos_token', 'eos_token', 'unk_token', 'sep_token', 'pad_token', 'cls_token', 'mask_token', 'additional_special_tokens']
add_prefix_space = {bool} False
added_tokens_decoder = {dict: 0} {}
added_tokens_encoder = {dict: 0} {}
additional_special_tokens = {list: 0} []
additional_special_tokens_ids = {list: 0} []
all_special_ids = {list: 3} [None, None, None]
all_special_tokens = {list: 3} ['<|endoftext|>', '<|endoftext|>', '<|endoftext|>']
all_special_tokens_extended = {list: 3} [AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True)]
bos_token = {str} '<|endoftext|>'
bos_token_id = {NoneType} None
bpe_ranks = {dict: 49996} 
byte_decoder = {dict: 256} 
byte_encoder = {dict: 256} 
cache = {dict: 2} 
cls_token = {NoneType} None
cls_token_id = {NoneType} None
decoder = {dict: 50257} {0: '<pad>', 1: '<s>', 2: '</s>', 3: '<unk>', 4: '<mask>', 5: '!', 6: '"',  ...
deprecation_warnings = {dict: 0} {}
encoder = {dict: 50257} {'<pad>': 0, '<s>': 1, '</s>': 2, '<unk>': 3, '<mask>': 4, '!': 5, '"': 6, '#': ...
eos_token = {str} '<|endoftext|>'
eos_token_id = {NoneType} None
errors = {str} 'replace'
init_inputs = {tuple: 0} ()
init_kwargs = {dict: 8} {'errors': 'replace', 'unk_token': AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'bos_token': AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'eos_token': AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'add_prefix_space': False, 'special_tokens_map_file': None, 'tokenizer_file': None, 'name_or_path': 'sberbank-ai/rugpt3small_based_on_gpt2'}
is_fast = {bool} False
mask_token = {NoneType} None
mask_token_id = {NoneType} None
max_len = {int} 1000000000000000019884624838656
max_len_sentences_pair = {int} 1000000000000000019884624838656
max_len_single_sentence = {int} 1000000000000000019884624838656
max_model_input_sizes = {dict: 5} {'gpt2': 1024, 'gpt2-medium': 1024, 'gpt2-large': 1024, 'gpt2-xl': 1024, 'distilgpt2': 1024}
model_input_names = {list: 1} ['attention_mask']
model_max_length = {int} 1000000000000000019884624838656
name_or_path = {str} 'sberbank-ai/rugpt3small_based_on_gpt2'
pad_token = {NoneType} None
pad_token_id = {NoneType} None
pad_token_type_id = {int} 0
padding_side = {str} 'right'
pat = {Pattern} regex.Regex("'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)|\\s+", flags=regex.V0)
pretrained_init_configuration = {dict: 0} {}
pretrained_vocab_files_map = {dict: 2} {'vocab_file': {'gpt2': 'https://huggingface.co/gpt2/resolve/main/vocab.json', 'gpt2-medium': 'https://huggingface.co/gpt2-medium/resolve/main/vocab.json', 'gpt2-large': 'https://huggingface.co/gpt2-large/resolve/main/vocab.json', 'gpt2-xl': 'https://huggingface.co/gpt2-xl/resolve/main/vocab.json', 'distilgpt2': 'https://huggingface.co/distilgpt2/resolve/main/vocab.json'}, 'merges_file': {'gpt2': 'https://huggingface.co/gpt2/resolve/main/merges.txt', 'gpt2-medium': 'https://huggingface.co/gpt2-medium/resolve/main/merges.txt', 'gpt2-large': 'https://huggingface.co/gpt2-large/resolve/main/merges.txt', 'gpt2-xl': 'https://huggingface.co/gpt2-xl/resolve/main/merges.txt', 'distilgpt2': 'https://huggingface.co/distilgpt2/resolve/main/merges.txt'}}
sep_token = {NoneType} None
sep_token_id = {NoneType} None
slow_tokenizer_class = {NoneType} None
special_tokens_map = {dict: 3} {'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>'}
special_tokens_map_extended = {dict: 3} {'bos_token': AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'eos_token': AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'unk_token': AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True)}
unique_no_split_tokens = {list: 1} ['<|endoftext|>']
unk_token = {str} '<|endoftext|>'
unk_token_id = {NoneType} None
verbose = {bool} True
vocab_files_names = {dict: 2} {'vocab_file': 'vocab.json', 'merges_file': 'merges.txt'}
vocab_size = {int} 50257
king-menin commented 3 years ago

Model was trained without this token you should add eos token and finetune model. Also we will release models trained with eos token.

mikhovr commented 3 years ago

Thank you! I'll add eos token manually.