Closed KoichiYasuoka closed 3 years ago
Thank you very much @KoichiYasuoka for reporting the bug!
I have just confirmed the wrong behavior when loading the tokenizer with AutoTokenizer
.
I will update the config file shortly.
I have just updated the config files for the newly-released models.
Now the tokenizers can be loaded withAutoTokenizer
correctly.
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained('cl-tohoku/bert-base-japanese-v2')
>>> tokenizer.tokenize("神奈川県民が選ぶ県内の“住みたい街ランキング”において、2020年は海老名が横浜に次ぐ第2位にランクイン。")
['神奈川', '県民', 'が', '選ぶ', '県', '内', 'の', '“', '住み', 'たい', '街', 'ランキング', '”', 'に', 'おい', 'て', '、', '2020', '年', 'は', '海老', '##名', ' が', '横浜', 'に', '次ぐ', '第', '2', '位', 'に', 'ランク', 'イン', '。']
>>> tokenizer = AutoTokenizer.from_pretrained('cl-tohoku/bert-base-japanese-char-v2')
>>> tokenizer.tokenize("神奈川県民が選ぶ県内の“住みたい街ランキング”において、2020年は海老名が横浜に次ぐ第2位にランクイン。")
['神', '奈', '川', '県', '民', 'が', '選', 'ぶ', '県', '内', 'の', '“', '住', 'み', 'た', 'い', '街', 'ラ', 'ン', 'キ', 'ン', 'グ', '”', 'に', 'お', 'い', 'て', '、', '2', '0', '2', '0', '年', 'は', '海', '老', '名', 'が', '横', '浜', 'に', '次', 'ぐ', '第', '2', '位', 'に', 'ラ', 'ン', 'ク', 'イ', 'ン', '。']
Thank you again for your comments, @KoichiYasuoka!
Thank you @singletongue and I've just confirmed the four models (bert-large-japanese, bert-large-japanese-char, bert-base-japanese-v2, and bert-base-japanese-char-v2) work well. Thank you again for your quick response and I close this issue.
Thank you @singletongue for releasing new BERT models at Hugging Face, but their
config.json
does not includethus Transformers'
AutoTokenizer
will useBertTokenizerFast
. Please compare new config.json with old one, and please check the blog here written in Japanese.