Open jannisbecker opened 1 year ago
This is fantastic, thank you @jannisbecker! I have updated the Ve Readme to include a link to your fork.
It's been a long time since I wrote the Ve Mecab parser, but I believe that I got the possible values for each POS level from the IPADIC users manual: https://ja.osdn.net/projects/ipadic/docs/ipadic-2.7.0-manual-en.pdf/en/1/ipadic-2.7.0-manual-en.pdf.pdf
You can also download IPADIC (https://drive.google.com/uc?export=download&id=0B4y35FiV1wh7MWVlSDBCSXZMTXM, via https://taku910.github.io/mecab/#download) and then parse through the CSV files. That should give you a comprehensive set of all the possible values.
Thank you for the kind words, I'm glad that the code was easy to comprehend 😊
I just tested ハハ in MeCab with IPADIC, and got the same result without lemma and hatsuon. Ruby Ve handles this by just giving nil for those properties. But it's something I should think about how to handle better.
Unless you need it for your own purposes, I wouldn't stress about porting the rest of Ve. It's really only the MeCab+IPADIC part that anyone uses 😄
Hi,
first of all, thank you for this wonderful work! Given that Mecab by itself does a bit of a mediocre job splitting into actual words, I've been wondering how sites like jisho.org do their sentence splitting and eventually landed here 👋
Since I needed this tech in an upcoming desktop app for Japanese learners (which includes OCR, sentence splitting, dictionary lookups and more), I took it upon me to port Ve's ipadic parser to Rust: https://github.com/jannisbecker/ve-rs. So far it seems to work great, down to having the same reported bugs as Ve 😄
I'm still fairly new to Rust so this was a pleasant learning experience as well. I'd like to share my experience here diving into Ve's sentence post-processing code and things I changed or wondered about while porting the code:
Rust requires strict definition of data structures, so I turned the POS variants, Grammar variants etc into enum definitions. The first thing that I noticed was that Ve's codebase treats all POS levels the same way. Essentially pos, pos2, pos3, pos4, inflection_type, inflection_form can theoretically be any variation of POS, which I assume doesn't really happen in practice with Mecab (say pos2 or even pos4 being classified Meishi). I played around with dividing all variants into seperate enums for POS1-4, InflectionType and InflectionForm, but I lacked knowledge in which field can exactly be which variant, so I rolled back the change. If Mecab provided an exact list of which field can contain which variant, then it might be an idea to split them up in code as well for clarity and ease of development.
Of course, understanding the whole set of implemented rules that alter tokens into words was not possible, particularly since I'm not versed in mecab classification, but other than that the code made a lot of sense (things like eating up the next or merging with previous tokens, merging in the fields, altering the resulting POS based on rules etc). All in all it was a very smooth process porting it, even if I've never seen a line of Ruby before 👍
I noticed that with my tokenizer, using a normal ipadic 2.7.0 dictionary, I've got rare cases where the feature string of a token didn't split up into 9 but only 6 features, leaving out lemma, reading and hatsuon completely instead of marking them with an asterisk. One token where this happened was ハハ . It might just be a bug in the tokenizer I used, but I had to account for it when destructuring the feature string.
While not necessary for my own project, I might take it upon me to port the other parsers (and general structure) of Ve as well in order to provide feature parity. For anyone looking for a mecab-ipadict sentence splitting in Rust right now though, you can use it as is.