Hi!
I used the pretrained model from GDrive. AssertionError gets raised unless special_tokens_map.json is removed from pretrained model dir. The assertion expects bos_token to be a string instead of a dict (from that json).
bash ./scripts/generate_ruGPT3Large.sh
2020-11-04 22:38:45.094336: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
11/04/2020 22:38:49 - INFO - transformers.tokenization_utils - Model name '/storage/Batyr/ru-gpts/checkpoints/rugpt3large_based_on_gpt2' not found in model shortcut name list (gpt2, gpt2-medium, gpt2-large, gpt2-xl, distilgpt2). Assuming '/storage/Batyr/ru-gpts/checkpoints/rugpt3large_based_on_gpt2' is a path, a model identifier, or url to a directory containing tokenizer files.
11/04/2020 22:38:49 - INFO - transformers.tokenization_utils - Didn't find file /storage/Batyr/ru-gpts/checkpoints/rugpt3large_based_on_gpt2/added_tokens.json. We won't load it.
11/04/2020 22:38:49 - INFO - transformers.tokenization_utils - loading file /storage/Batyr/ru-gpts/checkpoints/rugpt3large_based_on_gpt2/vocab.json
11/04/2020 22:38:49 - INFO - transformers.tokenization_utils - loading file /storage/Batyr/ru-gpts/checkpoints/rugpt3large_based_on_gpt2/merges.txt
11/04/2020 22:38:49 - INFO - transformers.tokenization_utils - loading file None
11/04/2020 22:38:49 - INFO - transformers.tokenization_utils - loading file /storage/Batyr/ru-gpts/checkpoints/rugpt3large_based_on_gpt2/special_tokens_map.json
11/04/2020 22:38:49 - INFO - transformers.tokenization_utils - loading file /storage/Batyr/ru-gpts/checkpoints/rugpt3large_based_on_gpt2/tokenizer_config.json
Traceback (most recent call last):
File "generate_transformers.py", line 269, in <module>
main()
File "generate_transformers.py", line 203, in main
tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
File "/storage/Batyr/ru-gpts/gpt_env/lib/python3.6/site-packages/transformers/tokenization_utils.py", line 393, in from_pretrained
return cls._from_pretrained(*inputs, **kwargs)
File "/storage/Batyr/ru-gpts/gpt_env/lib/python3.6/site-packages/transformers/tokenization_utils.py", line 545, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File "/storage/Batyr/ru-gpts/gpt_env/lib/python3.6/site-packages/transformers/tokenization_gpt2.py", line 149, in __init__
super().__init__(bos_token=bos_token, eos_token=eos_token, unk_token=unk_token, **kwargs)
File "/storage/Batyr/ru-gpts/gpt_env/lib/python3.6/site-packages/transformers/tokenization_utils.py", line 337, in __init__
assert isinstance(value, str), f'key: {key}, value: {value}'
AssertionError: key: bos_token, value: {'content': '<|endoftext|>', 'single_word': False, 'lstrip': False, 'rstrip': False, 'normalized': True}
Hi! I used the pretrained model from GDrive. AssertionError gets raised unless special_tokens_map.json is removed from pretrained model dir. The assertion expects bos_token to be a string instead of a dict (from that json).