VinAIResearch / PhoBERT

PhoBERT: Pre-trained language models for Vietnamese (EMNLP-2020 Findings)
MIT License
651 stars 92 forks source link

word_ids() is not available when using Python-based tokenizers #40

Closed Nguyendat-bit closed 2 years ago

Nguyendat-bit commented 2 years ago

Hi, I'm getting this error now, is there any way to fix it image

datquocnguyen commented 2 years ago

"tokenizers" likely makes use of a fast variant. And unfortunately, PhoBERTtokenizer currently does not support it atm. See: https://github.com/huggingface/transformers/pull/13788#pullrequestreview-771521131

datquocnguyen commented 2 years ago

@Nguyendat-bit The process of merging a fast tokenizer for PhoBERT is in the discussion, as detailed in https://github.com/huggingface/transformers/pull/17254#issuecomment-1133932067. While waiting for this pull request's approval, if you would like to experiment with the fast tokenizer, you might install transformers as follows:

git clone --single-branch --branch fast_tokenizers_BARTpho_PhoBERT_BERTweet https://github.com/datquocnguyen/transformers.git
cd transformers
pip3 install -e .
Nguyendat-bit commented 2 years ago

@datquocnguyen Great, thank you so much