ai-forever / ru-gpts

Russian GPT3 models.
Apache License 2.0
2.08k stars 444 forks source link

AssertionError raised when generating with GPT3Large #26

Closed pavelbatyr closed 3 years ago

pavelbatyr commented 3 years ago

Hi! I used the pretrained model from GDrive. AssertionError gets raised unless special_tokens_map.json is removed from pretrained model dir. The assertion expects bos_token to be a string instead of a dict (from that json).

bash ./scripts/generate_ruGPT3Large.sh
2020-11-04 22:38:45.094336: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
11/04/2020 22:38:49 - INFO - transformers.tokenization_utils -   Model name '/storage/Batyr/ru-gpts/checkpoints/rugpt3large_based_on_gpt2' not found in model shortcut name list (gpt2, gpt2-medium, gpt2-large, gpt2-xl, distilgpt2). Assuming '/storage/Batyr/ru-gpts/checkpoints/rugpt3large_based_on_gpt2' is a path, a model identifier, or url to a directory containing tokenizer files.
11/04/2020 22:38:49 - INFO - transformers.tokenization_utils -   Didn't find file /storage/Batyr/ru-gpts/checkpoints/rugpt3large_based_on_gpt2/added_tokens.json. We won't load it.
11/04/2020 22:38:49 - INFO - transformers.tokenization_utils -   loading file /storage/Batyr/ru-gpts/checkpoints/rugpt3large_based_on_gpt2/vocab.json
11/04/2020 22:38:49 - INFO - transformers.tokenization_utils -   loading file /storage/Batyr/ru-gpts/checkpoints/rugpt3large_based_on_gpt2/merges.txt
11/04/2020 22:38:49 - INFO - transformers.tokenization_utils -   loading file None
11/04/2020 22:38:49 - INFO - transformers.tokenization_utils -   loading file /storage/Batyr/ru-gpts/checkpoints/rugpt3large_based_on_gpt2/special_tokens_map.json
11/04/2020 22:38:49 - INFO - transformers.tokenization_utils -   loading file /storage/Batyr/ru-gpts/checkpoints/rugpt3large_based_on_gpt2/tokenizer_config.json
Traceback (most recent call last):
  File "generate_transformers.py", line 269, in <module>
    main()
  File "generate_transformers.py", line 203, in main
    tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
  File "/storage/Batyr/ru-gpts/gpt_env/lib/python3.6/site-packages/transformers/tokenization_utils.py", line 393, in from_pretrained
    return cls._from_pretrained(*inputs, **kwargs)
  File "/storage/Batyr/ru-gpts/gpt_env/lib/python3.6/site-packages/transformers/tokenization_utils.py", line 545, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/storage/Batyr/ru-gpts/gpt_env/lib/python3.6/site-packages/transformers/tokenization_gpt2.py", line 149, in __init__
    super().__init__(bos_token=bos_token, eos_token=eos_token, unk_token=unk_token, **kwargs)
  File "/storage/Batyr/ru-gpts/gpt_env/lib/python3.6/site-packages/transformers/tokenization_utils.py", line 337, in __init__
    assert isinstance(value, str), f'key: {key}, value: {value}'
AssertionError: key: bos_token, value: {'content': '<|endoftext|>', 'single_word': False, 'lstrip': False, 'rstrip': False, 'normalized': True}
king-menin commented 3 years ago

Thank you for issue. We will remove this file from archive. Also you can try load models with hugging face without archive.