Ethan-yt / guwenbert

GuwenBERT: 古文预训练语言模型(古文BERT) A Pre-trained Language Model for Classical Chinese (Literary Chinese)
Apache License 2.0
479 stars 41 forks source link

Tokenizer 工作异常 #25

Open Lizi-12 opened 4 months ago

Lizi-12 commented 4 months ago

从Hugging Face使用了guwenbert,但是tokenization的结果仅仅是把一个句子分成一个个中文字符。想了解一下这是正常的吗。谢谢!

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('ethanyt/guwenbert-base')

text = '贪生养命事皆同,独坐闲居意颇慵。入夏驱驰巢树鹊,经春劳役探花蜂。石炉香尽寒灰薄,铁磬声微古锈浓。寂寂虚怀无一念,任从苍藓没行踪。'

tokens = tokenizer.tokenize(text)
ids = tokenizer.convert_tokens_to_ids(tokens)
print(tokens)
print(ids)

结果:

`['贪', '生', '养', '命', '事', '皆', '同', ',', '独', '坐', '闲', '居', '意', '颇', '慵', '。', '入', '夏', '驱', '驰', '巢', '树', '鹊', ',', '经', '春', '劳', '役', '探', '花', '蜂', '。', '石', '炉', '香', '尽', '寒', '灰', '薄', ',', '铁', '磬', '声', '微', '古', '锈', '浓', '。', '寂', '寂', '虚', '怀', '无', '一', '念', ',', '任', '从', '苍', '藓', '没', '行', '踪', '。']

[1225, 38, 546, 190, 42, 94, 105, 5, 427, 424, 819, 231, 181, 1251, 4388, 4, 106, 452, 1571, 1367, 1779, 666, 2659, 5, 124, 224, 771, 980, 1806, 278, 2740, 4, 198, 2090, 389, 255, 353, 1864, 965, 5, 1148, 2761, 243, 547, 202, 7507, 2072, 4, 1185, 1185, 373, 843, 18, 10, 480, 5, 347, 122, 1155, 4338, 833, 49, 2353, 4]`
Ethan-yt commented 1 week ago

你好,这是正常的。guwenbert的分词器是以字为单位的。