VinAIResearch / PhoBERT

PhoBERT: Pre-trained language models for Vietnamese (EMNLP-2020 Findings)
MIT License
651 stars 92 forks source link

Non-consecutive added token '{token}' found. #21

Closed chicuong209 closed 4 years ago

chicuong209 commented 4 years ago

As the title. I meet the below error when using PhoBertTokenizer for Vietnamese Question Answering task. Could you please help me to fix it ? Thank you. f"Non-consecutive added token '{token}' found. " AssertionError: Non-consecutive added token '<mask>' found. Should have index 5 but has index 64000 in saved vocabulary. Btw, i have tried to set self.encoder[self.mask_token] = 4, the training process can run normally, but it doesn't seem a right way.

datquocnguyen commented 4 years ago

Please could you provide more details (data, scripts, .... as much as you can) ? Probably you saved a dictionary and then tried to reload it?

chicuong209 commented 4 years ago

Please could you provide more details (data, scripts, .... as much as you can) ? Probably you saved a dictionary and then tried to reload it?

  1. The data is the SQuAD v1.1 dataset that was translated into Vietnamese. I use run_squad.py from huggingface examples, but I call directly PhoBertConfig, PhoBertTokenizer and PhoBertModelForQuestionAnswering instead of using AutoConfig, AutoTokenizer, and AutoModelForQuestionAnswering.
  2. Yes, the dictionary is saved and reloaded.
datquocnguyen commented 4 years ago

Then you should skip step 2. Download the dictionary and bpe files from https://huggingface.co/vinai/phobert-base#list-files and load the tokenizer using: tokenizer=PhoBertTokenizer(path-to-dictionay-file, path-to-bpe-file)

datquocnguyen commented 4 years ago

How come you'd need to save and reload the dictionary ? It's pretty weird :|

chicuong209 commented 4 years ago

Then you should skip step 2. Download the dictionary and bpe files from https://huggingface.co/vinai/phobert-base#list-files and load the tokenizer using: tokenizer=PhoBertTokenizer(path-to-dictionay-file, path-to-bpe-file)

ok. I'll try it now

kvt0012 commented 4 years ago

I have same issue with him. PhoBert model is ok but tokenizer was not found. The error is as below: OSError: Model name 'vinai/phobert-base' was not found in tokenizers model name list (roberta-base, roberta-large, roberta-large-mnli, distilroberta-base, roberta-base-openai-detector, roberta-large-openai-detector). We assumed 'vinai/phobert-base' was a path, a model identifier, or url to a directory containing vocabulary files named ['vocab.json', 'merges.txt'] but couldn't find such vocabulary files at this path or url. I think many people will meet this issue so I post it here :D thanks for your kindly response :D

datquocnguyen commented 4 years ago

Please install transformers from its latest source:

git clone https://github.com/huggingface/transformers.git
cd transformers
pip3 install --upgrade .

And also clean/remove your transformers folder in ~/.cache/torch, so it'd automatically re-download PhoBERT properly. It should work.

datquocnguyen commented 4 years ago

Please install transformers from its latest source:

git clone https://github.com/huggingface/transformers.git
cd transformers
pip3 install --upgrade .

And also clean/remove your transformers folder in ~/.cache/torch, so it'd automatically re-download PhoBERT properly. It should work.

@chicuong209 if there is any problem, you might want to follow the above instruction. I'm pretty sure PhoBERT would work without any loading issue.

kvt0012 commented 4 years ago

Please install transformers from its latest source:

git clone https://github.com/huggingface/transformers.git
cd transformers
pip3 install --upgrade .

And also clean/remove your transformers folder in ~/.cache/torch, so it'd automatically re-download PhoBERT properly. It should work.

Thank you, it works for me.

kvt0012 commented 4 years ago

Please install transformers from its latest source:

git clone https://github.com/huggingface/transformers.git
cd transformers
pip3 install --upgrade .

And also clean/remove your transformers folder in ~/.cache/torch, so it'd automatically re-download PhoBERT properly. It should work.

Thank you, it works for me. The problem is I think it will be download PhoBERT automatically when I run command to install transformers from pip.