SamuraiT / mecab-python3

:snake: mecab-python. you can find original version here:http://taku910.github.io/mecab/
https://pypi.python.org/pypi/mecab-python3
Other
541 stars 51 forks source link

Different output formats possible via constructor arguments? #99

Closed jzohrab closed 1 year ago

jzohrab commented 1 year ago

Hello, thank you very much for your work on this project. I'm using MeCab for a language-learning program, and would like to use this library if possible.

The mecab binary allowed for some arguments to be passed which would affect its output. For example:

$ mecab -F %m\\t%t\\t%h\\n -U %m\\t%t\\t%h\\n -E EOP\\t3\\t7\\n
太郎はこの本を女性に渡した。
太郎  2   44
は   6   16
この  6   68
本   2   38
を   6   13
女性  2   38
に   6   13
渡し  2   31
た   6   25
。   3   7
EOP 3   7

Is there a way to get the same with this python library? I tried some obvious attempts, e.g.

import MeCab
t = MeCab.Tagger('-F %m\\t%t\\t%h\\n -U %m\\t%t\\t%h\\n -E EOP\\t3\\t7\\n -r ./mecabrc_dummy.txt -d ./.venv/lib/python3.11/site-packages/unidic_lite/dicdir')   # also tried single \ instead of \\
sentence = "太郎はこの本を女性に渡した。"
print(t.parse(sentence))

but this still outputs the same as the default Tagger output:

$ python main.py 
太郎  タロー タロウ タロウ 名詞-固有名詞-人名-名            1
は   ワ   ハ   は   助詞-係助詞          
この  コノ  コノ  此の  連体詞         0
...
渡し  ワタシ ワタス 渡す  動詞-一般   五段-サ行   連用形-一般  0
た   タ   タ   た   助動詞 助動詞-タ   終止形-一般  
。           。   補助記号-句点         
EOS

I edited unidic_lite/dicdir/dicrc:

output-format-type = custom

; output custom - new three-column output
node-format-custom = %m\t%t\t%h\n
unk-format-custom  = %m\t%t\t%h\n
bos-format-custom  =
eos-format-custom  = EOP\t3\t7\n

With that, the output was more or less what I expected (the third column is different, but that doesn't matter):

$ python main.py 
太郎  2   1
は   6   1
この  6   1
本   2   1
...
た   6   1
。   3   1
EOP 3   7

I did try with unidic, instead of unidic_lite,

t = MeCab.Tagger('-r ./mecabrc_dummy.txt -d ./.venv/lib/python3.11/site-packages/unidic/dicdir -F %m\\t%t\\t%h\\n -U %m\\t%t\\t%h\\n -E EOP\\t3\\t7\\n')

and got the default unidic output:

太郎  名詞,固有名詞,人名,名,,,タロウ,タロウ,太郎,タロー,太郎,タロー,固,"","","","","","",名,タロウ,タロウ,タロウ,タロウ,"1","","",6252931250790912,22748
は   助詞,係助詞,,,,,ハ,は,は,ワ,は,ワ,和,"","","","","","",係助,ハ,ハ,ハ,ハ,"","動詞%F2@0,名詞%F1,形容詞%F2@-1","",8059703733133824,29321
この  連体詞,,,,,,コノ,此の,この,コノ,この,コノ,和,"","","","","","",相,コノ,コノ,コノ,コノ,"0","","",3547308012741120,12905
...
。   補助記号,句点,,,,,,。,。,,。,,記号,"","","","","","",補助,,,,,"","","",6880571302400,25
EOS

Thank you again!

polm commented 1 year ago

This is not possible due to a long standing issue in MeCab that causes the UniDic config file to take precedence. Your command line version only works because your config (presumably IPAdic) doesn't specify a default format. I made a PR to fix it six years ago but never received any response.

https://github.com/taku910/mecab/pull/38

However, rather than using MeCab's rather arcane format syntax, I suggest you use fugashi's structured Node objects to create formatted node output - it should be much easier.

https://github.com/polm/fugashi