Kimtaro / ve

A linguistic framework that's easy to use.
MIT License
216 stars 25 forks source link

Ve for Rust #57

Open jannisbecker opened 1 year ago

jannisbecker commented 1 year ago

Hi,

first of all, thank you for this wonderful work! Given that Mecab by itself does a bit of a mediocre job splitting into actual words, I've been wondering how sites like jisho.org do their sentence splitting and eventually landed here 👋

Since I needed this tech in an upcoming desktop app for Japanese learners (which includes OCR, sentence splitting, dictionary lookups and more), I took it upon me to port Ve's ipadic parser to Rust: https://github.com/jannisbecker/ve-rs. So far it seems to work great, down to having the same reported bugs as Ve 😄

I'm still fairly new to Rust so this was a pleasant learning experience as well. I'd like to share my experience here diving into Ve's sentence post-processing code and things I changed or wondered about while porting the code:

While not necessary for my own project, I might take it upon me to port the other parsers (and general structure) of Ve as well in order to provide feature parity. For anyone looking for a mecab-ipadict sentence splitting in Rust right now though, you can use it as is.

Kimtaro commented 1 year ago

This is fantastic, thank you @jannisbecker! I have updated the Ve Readme to include a link to your fork.

It's been a long time since I wrote the Ve Mecab parser, but I believe that I got the possible values for each POS level from the IPADIC users manual: https://ja.osdn.net/projects/ipadic/docs/ipadic-2.7.0-manual-en.pdf/en/1/ipadic-2.7.0-manual-en.pdf.pdf

You can also download IPADIC (https://drive.google.com/uc?export=download&id=0B4y35FiV1wh7MWVlSDBCSXZMTXM, via https://taku910.github.io/mecab/#download) and then parse through the CSV files. That should give you a comprehensive set of all the possible values.

Thank you for the kind words, I'm glad that the code was easy to comprehend 😊

I just tested ハハ in MeCab with IPADIC, and got the same result without lemma and hatsuon. Ruby Ve handles this by just giving nil for those properties. But it's something I should think about how to handle better.

Unless you need it for your own purposes, I wouldn't stress about porting the rest of Ve. It's really only the MeCab+IPADIC part that anyone uses 😄