ai-forever / ru-gpts

Russian GPT3 models.
Apache License 2.0
2.08k stars 444 forks source link

Tokenizer AssertionError while loading data #32

Closed Kepler-Br closed 3 years ago

Kepler-Br commented 3 years ago

Hello! I was trying to finetune your model on google colab, but I stumbled upon a "curious" error

Traceback (most recent call last):
  File "pretrain_megatron.py", line 714, in <module>
    main()
  File "pretrain_megatron.py", line 659, in main
    args.eod_token = get_train_val_test_data(args)
  File "pretrain_megatron.py", line 594, in get_train_val_test_data
    args)
  File "/content/ru-gpts/configure_data.py", line 34, in apply
    return make_loaders(args)
  File "/content/ru-gpts/configure_data.py", line 170, in make_loaders
    train, tokenizer = data_utils.make_dataset(**data_set_args)
  File "/content/ru-gpts/data_utils/__init__.py", line 101, in make_dataset
    pad_token, character_converage, **kwargs)
  File "/content/ru-gpts/data_utils/tokenization.py", line 43, in make_tokenizer
    return GPT2BPETokenizer(model_path=model_path, **kwargs)
  File "/content/ru-gpts/data_utils/tokenization.py", line 823, in __init__
    self.text_tokenizer = GPT2Tokenizer.from_pretrained(model_path, cache_dir=cache_dir)
  File "/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils.py", line 393, in from_pretrained
    return cls._from_pretrained(*inputs, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils.py", line 544, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/usr/local/lib/python3.6/dist-packages/transformers/tokenization_gpt2.py", line 149, in __init__
    super().__init__(bos_token=bos_token, eos_token=eos_token, unk_token=unk_token, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils.py", line 337, in __init__
    assert isinstance(value, str)
AssertionError

Here are my arguments:

       --train-data /content/ru-gpts/data/train1.jsonl \
       --valid-data /content/ru-gpts/data/valid.jsonl \
       --test-data /content/ru-gpts/data/valid.jsonl \
       --save /content/ru-gpts/checkpoints/checkpoints_${now}_${host} \
       --load /content/ru-gpts/gpt3model \
       --save-interval 500 \
       --eval-interval 500 \
       --log-interval 100 \
       --model-parallel-size ${MP_SIZE} \
       --num-layers 24 \
       --hidden-size 1536 \
       --num-attention-heads 16 \
       --seq-length 2048 \
       --max-position-embeddings 2048 \
       --vocab-size 50257 \
       --batch-size 1 \
       --train-iters 200000 \
       --distributed-backend nccl \
       --lr 0.00015 \
       --lr-decay-style cosine \
       --weight-decay 1e-2 \
       --clip-grad 1.0 \
       --warmup .01 \
       --fp16 \
       --lazy-loader \
       --checkpoint-activations \
       --loose-json \
       --text-key text \
       --tokenizer-path /content/ru-gpts/gpt3model \
       --tokenizer-type GPT2BPETokenizer \
       --finetune

I've got your model from archive from gdrive What's wrong with tokenizer? It's GPT3Large

Kepler-Br commented 3 years ago

I've used unidecode to decode non english/russian words into ascii.

Kepler-Br commented 3 years ago

That's funny. You should remove special_tokens to prevent this error from happening.