delph-in / jacy

The Jacy Japanese Grammar
http://moin.delph-in.net/JacyTop
Other
13 stars 5 forks source link

jpn2yy.py issues? #52

Closed goodmami closed 7 years ago

goodmami commented 7 years ago

I'm getting errors when trying to run jpn2yy.py:

First, I install mecab-ipadic and libmecab-dev, then pip install mecab-python. First I see this:

goodmami@tpy:~/grammars/jacy$ python utils/jpn2yy.py < ex
Traceback (most recent call last):
  File "utils/jpn2yy.py", line 54, in <module>
    print(''.join(jp2yy(line.rstrip())).encode('utf-8'))
  File "utils/jpn2yy.py", line 28, in jp2yy
    (form, p, lemma, p1, p2, p3) = tok.decode('utf-8').split('\t')
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 18-19: invalid continuation byte

Then if I adjust the file to avoid unicode errors, I get this:

goodmami@tpy:~/grammars/jacy$ python3 utils/jpn2yy.py < ../jaen/utils/extr-rules/jaen/ex
Traceback (most recent call last):
  File "utils/jpn2yy.py", line 44, in <module>
    print(''.join(jp2yy(line.rstrip())).encode('utf-8'))
  File "utils/jpn2yy.py", line 22, in jp2yy
    for tok in m.parse(sent.encode('utf-8')).split('\n'):
  File "/home/goodmami/.local/lib/python3.5/site-packages/MeCab.py", line 281, in parse
    def parse(self, *args): return _MeCab.Tagger_parse(self, *args)
NotImplementedError: Wrong number or type of arguments for overloaded function 'Tagger_parse'.
  Possible C/C++ prototypes are:
    MeCab::Tagger::parse(MeCab::Model const &,MeCab::Lattice *)
    MeCab::Tagger::parse(MeCab::Lattice *) const
    MeCab::Tagger::parse(char const *)

Did the MeCab API change?

goodmami commented 7 years ago

Update: I got it working with Janome as the back-end, which is a Python reimplementation (no calling out to C++ code), but uses MeCab's IPADIC database. It's a lot nicer to work with, too. If jpn2yy.py doesn't work for you currently, let me know if I should check in this version.

If you want to continue using bindings to the original MeCab bindings, here are a couple of alternatives to the mecab-python package:

Considering the maintenance status of these packages, mecab-python was last updated about 2 years ago, mecab-python3 about 3 years ago, natto-py about 1 year ago, and janome about 1 month ago.

goodmami commented 7 years ago

Hmm... it looks like my original problem came from having mecab-ipadic installed and not mecab-ipadic-utf8. With the latter, it works fine with Python2, but not Python3.

So while we're on the subject, there are two remaining issues:

fcbond commented 7 years ago

It would be good to get python 3 support. On my not so old ubuntu (16.04.3), there is no mecab-python3 or janome package, which is why I had not done anything. I guess we could move to janome, ...

":" is correct, not "+", so ja2yy needs to be fixed (or perhaps removed better to just have one script in jacy).

What do you think?

On Tue, Sep 12, 2017 at 8:45 AM, Michael Wayne Goodman < notifications@github.com> wrote:

Hmm... it looks like my original problem came from having mecab-ipadic installed and not mecab-ipadic-utf8. With the latter, it works fine with Python2, but not Python3.

So while we're on the subject, there are two remaining issues:

  • Should we switch to or add Python3 support?
  • The format of the pos info in jpn2yy.py is different from JaEn's ja2yy.py. Compare:

    jpn2yy: "名詞-固有名詞-地域-国:n-n" ja2yy: "名詞-固有名詞-地域-国+n-n"

    It's basically the use of : or + before the last two fields, but the logic for constructing that string is slightly different, too. Do you know how this should be done?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/delph-in/jacy/issues/52#issuecomment-328698345, or mute the thread https://github.com/notifications/unsubscribe-auth/ABD8xnL9mWqe5DaFhP0kEbuR_hUTlMiBks5shdQqgaJpZM4PT44Y .

-- Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University

goodmami commented 7 years ago

Janome is not available via apt; it is available on pip, which should work for 16.04 as well. The website says performance is "similar" to MeCab, since it uses the same dictionary, but I haven't compared.

Speaking of which, I noticed a number of differences in tokenization with a current MeCab to what's stored in the Tanaka corpus. Some differences looked better, but there were some regressions as well. It makes me wonder how hard it is to tune MeCab (or whatever)'s model to the Jacy lexicon.

And finally, I agree that we should get rid of one of the scripts (probably the one in JaEn, since it's a Jacy-specific thing).

goodmami commented 7 years ago

Also, I found that Janome is quite a bit slower than the mecab-python package, especially with startup time.

Faster yet is calling mecab directly, though we can't get everything:

goodmami@tpy:~/grammars/jacy/utils$ echo -e "バククが勉強する" | mecab --node-format='(-1, -1, -1, <%ps:%pe>, 1, "%m", 0, "null", %F-[0,1,2]:%f[4]-%f[5] 1.0)\n'
(-1, -1, -1, <0:9>, 1, "バクク", 0, "null", 名詞-一般:- 1.0)
(-1, -1, -1, <9:12>, 1, "が", 0, "null", 助詞-格助詞-一般:- 1.0)
(-1, -1, -1, <12:18>, 1, "勉強", 0, "null", 名詞-サ変接続:- 1.0)
(-1, -1, -1, <18:24>, 1, "する", 0, "null", 動詞-自立:サ変・スル-基本形 1.0)
EOS

We could populate the ID, start, and end fields with awk or something. The cfrom/cto values here are byte positions, not characters. And the trailing part of the POS tag is :- when there's no values. Or we could have a script jpn2yy.sh that calls mecab and pipes the output to a Python script to clean up the rest. Then we don't need a Python package, and it should be plenty fast.

fcbond commented 7 years ago

I like the idea of not needing a python package. If mecab+script is faster than mecab-python then I am happy to go with that (but I am not happy with bytes instead of characters).

Speed has not been a big issue for me so far, but faster is generally better, and fewer dependencies is definitely better.

On Fri, Sep 15, 2017 at 3:47 AM, Michael Wayne Goodman < notifications@github.com> wrote:

Also, I found that Janome is quite a bit slower than the mecab-python package, especially with startup time.

Faster yet is calling mecab directly, though we can't get everything:

goodmami@tpy:~/grammars/jacy/utils$ echo -e "バククが勉強する" | mecab --node-format='(-1, -1, -1, <%ps:%pe>, 1, "%m", 0, "null", %F-[0,1,2]:%f[4]-%f[5] 1.0)\n' (-1, -1, -1, <0:9>, 1, "バクク", 0, "null", 名詞-一般:- 1.0) (-1, -1, -1, <9:12>, 1, "が", 0, "null", 助詞-格助詞-一般:- 1.0) (-1, -1, -1, <12:18>, 1, "勉強", 0, "null", 名詞-サ変接続:- 1.0) (-1, -1, -1, <18:24>, 1, "する", 0, "null", 動詞-自立:サ変・スル-基本形 1.0) EOS

We could populate the ID, start, and end fields with awk or something. The cfrom/cto values here are byte positions, not characters. And the trailing part of the POS tag is :- when there's no values. Or we could have a script jpn2yy.sh that does this and pipes the output to a Python script to clean up the rest. Then we don't need a Python package, and it should be plenty fast.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/delph-in/jacy/issues/52#issuecomment-329589519, or mute the thread https://github.com/notifications/unsubscribe-auth/ABD8xj4KGfkpg_iWvyXftoCpVkXY1SUgks5siYLugaJpZM4PT44Y .

-- Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University

goodmami commented 7 years ago

Do you care about accurate cfrom/cto values to the original string? E.g. should the following be given the same or different values?

バククが勉強する
   バククが  勉強する

awk, for instance, is not unicode-aware (gawk is, but it's not included in Ubuntu by default... i.e. another dependency), so we would need something like Python minimally for deciding that "が" is length 1 and not 3. But Python without any mecab dependencies should be simple and quick.

fcbond commented 7 years ago

On Fri, Sep 15, 2017 at 9:16 AM, Michael Wayne Goodman < notifications@github.com> wrote:

Do you care about accurate cfrom/cto values to the original string? E.g. should the following be given the same or different values?

バククが勉強する バククが 勉強する

awk, for instance, is not unicode-aware (gawk is, but it's not included in Ubuntu by default... i.e. another dependency), so we would need something like Python minimally for deciding that "が" is length 1 and not 3. But Python without any mecab dependencies should be simple and quick.

It should definitely be to the original string (that was passed to jpn2yy).

-- Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University

goodmami commented 7 years ago

Ok, here's some times:

# mecab and minimal python
goodmami@tpy:~/grammars/jacy$ time cat tc.ja | utils/jpn2yy >/dev/null
real    0m0.282s
user    0m0.384s
sys 0m0.032s
# python with mecab-python
goodmami@tpy:~/grammars/jacy$ time cat tc.ja | python utils/jpn2yy.py >/dev/null 
real    0m0.427s
user    0m0.396s
sys 0m0.036s
# python with janome
goodmami@tpy:~/grammars/jacy$ time cat tc.ja | python3 utils/jpn2yy.janome.py >/dev/null
real    0m8.028s
user    0m7.892s
sys 0m0.124s
# mecab only
goodmami@tpy:~/grammars/jacy$ time cat tc.ja | mecab >/dev/null
real    0m0.085s
user    0m0.076s
sys 0m0.012s

The tc.ja file is 4421 sentences from the Tanaka corpus. We can't beat pure mecab, but the minimal python version is about 1/3 slower than the current version. The Janome one is more than 16x slower than the current one. The minimal one has no python dependencies, but it still, of course, requires the mecab binary to be installed. Shall I check it the minimal-python script?

goodmami commented 7 years ago

I added the jpn2yy script but didn't delete the old one yet. I will close the issue, but please try it out and make sure it works for you. If not, reopen the issue with a description of the problem.