AutoTokenizer.from_pretrained doesn't work on newer models

KoichiYasuoka commented 3 years ago

Thank you @singletongue for releasing new BERT models at Hugging Face, but their config.json does not include

  "tokenizer_class": "BertJapaneseTokenizer",

thus Transformers' AutoTokenizer will use BertTokenizerFast. Please compare new config.json with old one, and please check the blog here written in Japanese.

singletongue commented 3 years ago

Thank you very much @KoichiYasuoka for reporting the bug! I have just confirmed the wrong behavior when loading the tokenizer with AutoTokenizer. I will update the config file shortly.

singletongue commented 3 years ago

I have just updated the config files for the newly-released models. Now the tokenizers can be loaded withAutoTokenizer correctly.

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained('cl-tohoku/bert-base-japanese-v2')
>>> tokenizer.tokenize("神奈川県民が選ぶ県内の“住みたい街ランキング”において、2020年は海老名が横浜に次ぐ第2位にランクイン。")
['神奈川', '県民', 'が', '選ぶ', '県', '内', 'の', '“', '住み', 'たい', '街', 'ランキング', '”', 'に', 'おい', 'て', '、', '2020', '年', 'は', '海老', '##名', ' が', '横浜', 'に', '次ぐ', '第', '2', '位', 'に', 'ランク', 'イン', '。']
>>> tokenizer = AutoTokenizer.from_pretrained('cl-tohoku/bert-base-japanese-char-v2')
>>> tokenizer.tokenize("神奈川県民が選ぶ県内の“住みたい街ランキング”において、2020年は海老名が横浜に次ぐ第2位にランクイン。")
['神', '奈', '川', '県', '民', 'が', '選', 'ぶ', '県', '内', 'の', '“', '住', 'み', 'た', 'い', '街', 'ラ', 'ン', 'キ', 'ン', 'グ', '”', 'に', 'お', 'い', 'て', '、', '2', '0', '2', '0', '年', 'は', '海', '老', '名', 'が', '横', '浜', 'に', '次', 'ぐ', '第', '2', '位', 'に', 'ラ', 'ン', 'ク', 'イ', 'ン', '。']

Thank you again for your comments, @KoichiYasuoka!

KoichiYasuoka commented 3 years ago

Thank you @singletongue and I've just confirmed the four models (bert-large-japanese, bert-large-japanese-char, bert-base-japanese-v2, and bert-base-japanese-char-v2) work well. Thank you again for your quick response and I close this issue.

cl-tohoku / bert-japanese

AutoTokenizer.from_pretrained doesn't work on newer models #24