Kimtaro / ve

A linguistic framework that's easy to use.
MIT License
215 stars 25 forks source link

いじめっ子 parses as two instead of one #18

Closed vietqhoang closed 10 years ago

vietqhoang commented 10 years ago
2.1.1 :001 > Ve.in(:ja).words("いじめっ子").collect(&:word)
 => ["いじめ", "っ子"] 
2.1.1 :002 > Ve.in(:ja).words("苛めっ子").collect(&:word)
 => ["苛め", "っ子"] 
Kimtaro commented 10 years ago

Oops, that looks like a bug. Will investigate.

fasiha commented 9 years ago

Interesting! I see that this happens only with IPADIC: Jumandic parses both strings as one word, but I don't know enough to know if there are words that IPADIC parses correctly while Juman splits incorrectly.

Interestingly, the JDepP bunsetsu chunker trained on IPADIC will correctly assign both "words according to IPADIC" to the same bunsetsu:

$ echo "いじめっ子" | mecab | jdepp -m jdepp-ipa/model/knbc 2>/dev/null | to_chunk.py 
# S-ID: 1; J.DepP
いじめ っ子 EOS

(Sorry, I realize that this output may be meaningless without some exposure to how JDepP works. to_chunk.py is a JDepP helper that puts | between bunsetsu.)

Question: Is the words function in MecabIpadic basically implementing a bunsetsu chunker like JDepP? (I ask especially because of the comment in line 183: This is becoming very big.)

(As an aside, I only thought to check this with Jumandic because JDepP by default uses that---no idea why. But I retrained it on IPADIC to see what it did when its underlying dictionary was incorrectly splitting things up. JDepP is also handy because it is a dependency parser: it computes the dependencies between the bunsetsu it chunks. And finally, sorry if you know all this :).)

Kimtaro commented 9 years ago

How MeCab and Jumandic parses a sentence depends a lot on the data they were trained on, IPADIC, Naist-Jdic, Unidic etc. So it's hard to make comparisons. On beta.jisho.org I amend the training data with my data, so I get different results from the standard MeCab+IPADIC setup.

I'll comment on JDepP in #12.

Ve is not doing bunsetsu chunking, the idea is very different. Ve is a frontend to parsers like MeCab to make their output useful to people without linguistic training.

MeCab is a morphological parser, but most people don't know what morphemes are, so Ve tries to reanalyze the MeCab output into "words" and a handful of commonly known parts of speech. The idea is to do the same for any language, so that there's a unified interface to simple linguistic parsing. That's its vague mission statement :)