KoichiYasuoka / UD-Chinese

Tokenizer POS-tagger and Dependency-parser for Chinese (简体/繁體/文言文)
MIT License
1 stars 2 forks source link

By-pass the tokenizer #1

Open wongtaksum opened 2 years ago

wongtaksum commented 2 years ago

Thank you for creating the tool for public use! I found that the tokenizer does not work well in some occasions. Is there any way to give a delimited input to your POS and dependency parser directly and by-pass your tokenizer?

KoichiYasuoka commented 2 years ago

By-pass the tokenizer... Well, you can do that, just using ufal.udpipe module directly:

>>> import os,ufal.udpipe,udchinese.udchinese
>>> m=ufal.udpipe.Model.load(os.path.join(udchinese.udchinese.PACKAGE_DIR,"ud-chinese.udpipe"))
>>> udpipe=ufal.udpipe.Pipeline(m,"conllu","","","")
>>> nlp=lambda x:udpipe.process("\n".join("\t".join([str(i+1),j]+["_"]*8) for i,j in enumerate(x.split()))+"\n\n")
>>> doc=nlp("不 入 虎穴 不 得 虎子")
>>> print(doc)
1   不   不   ADV v,副詞,否定,無界  Polarity=Neg    2   advmod  _   _
2   入   入   VERB    v,動詞,行為,移動  _   0   root    _   _
3   虎穴  虎穴  NOUN    n,名詞,固定物,地形 _   2   obj _   _
4   不   不   ADV v,副詞,否定,無界  Polarity=Neg    5   advmod  _   _
5   得   得   VERB    v,動詞,行為,得失  _   2   parataxis   _   _
6   虎子  虎子  NOUN    n,名詞,人,人    _   5   obj _   _

But it seems to cause several bad effects...