fxsjy / jieba

结巴中文分词
MIT License
33.41k stars 6.73k forks source link

添加自定义辞典后,英文切分出现问题 #959

Open shenchenguang opened 2 years ago

shenchenguang commented 2 years ago

你好! cut_all=True 模式下,添加自定义辞典 acarbosearb ,会将 acarbose 这个词切分成 acarboseose

jieba.add_word('acarbose')
jieba.add_word('arb')
text = "rosiglitazone orlistat and acarbose have significant effects on the anthropometric indices in women with PCOS"
r = jieba.lcut(text.lower(), cut_all=True, HMM=False)
r = list(set(list(r)))
print(r)

分词结果 ['indices', 'women', 'anthropometric', 'arb', ' ', 'with', 'acarboseose', 'orlistat', 'rosiglitazone', 'the', 'in', 'effects', 'have', 'pcos', 'on', 'significant', 'and']

另外在 cut_all=False 模式下添加自定义辞典 arni,会将 learning 切分为 arni

jieba.add_word('arni')
text = "Further learning of the hypoglycemic mechanism of SGLT2i besides the kidney can provide a new understanding for its application in the treatment of diabetes."
r = jieba.lcut(text.lower(), cut_all=False, HMM=False)
r = list(r)
print(r)

分词结果 ['besides', ' ', 'treatment', 'kidney', 'can', 'for', 'in', 'a', 'le', 'understanding', 'hypoglycemic', 'ng', 'arni', 'its', 'of', 'mechanism', 'application', 'provide', 'the', '.', 'sglt2i', 'new', 'diabetes', 'further']

manother commented 2 years ago

邮件已收到~

shouldsee commented 2 years ago

看来cut_all 模式需要一个细致的分析额

Hexa4C commented 2 years ago

遇到了同样的问题,蹲一个更新 Same issue, waiting for update