Documentation of output needed

buruzaemon commented 9 years ago

Suppose I parse a sentence, and then I look at one of the tokens from the sentence. I can get some information from that token, but what precisely does the information mean?

The word "彼女" has char_type of 2. What does 2 mean? What do the other constants for char_type indicate? I assume there's a list of this somewhere.

For "彼女", stat is 0. Is this important? What is this stat field?

Let's look at the very useful feature attribute. Here it is for "彼女". "名詞,代名詞,一般,*,*,*,彼女,カノジョ,カノジョ"

If we split by comma, we get something like grammatical terms for the first two or three fields and some kind of written or phonetic readings for the last three. What precisely goes where, and when? I assume this is generated by MeCab, but of course the MeCab documentation, if it exists, is in Japanese.

If nothing else, a list of the possible values for these fields, and a table showing what integers indicate what, would be of great value. Of course anyone using this code should have some idea what the grammar terms themselves mean, but on the other hand, we cannot guess beforehand which terms will show up.

Originally reported by Douglas Perkins on 2014-02-16. This issue was ported from Bitbucket and is archived for historical reasons.

buruzaemon commented 9 years ago

Hello, Douglas:

Sorry for the delay in responding, but thank you for raising this issue.

To clarify, it looks like you raise 2 issues...

There is a need to document the attributes of Natto::MeCabNode
There is a need to document the output format MeCab.

Let me know if i understand you correctly. Further, it would really help if you could explain what you would like to do with natto and MeCab. I browsed your github repo, but couldn't quite understand what you are trying to do. Would be very good for you to provide some context.

btw, since your request involves only documentation, I am going to downgrade priority from major to minor.

buruzaemon commented 9 years ago

Dear Brooke,

Regarding the issues I was hoping to raise, exactly as you say.

I've been writing scripts to do various things. One thing I'd like to do is have a list of sentences in Japanese and from it produce a list of the same sentences but with kanji converted to kana. I can do that already, because there are some great examples on the project Wiki here. :-) Of course there will be issues with words that can be read in multiple ways, but that's unavoidable.

Merely producing kana, though, is not so great because it's hard to parse by eye. So I'd like to make kana but put spaces between words. As you know, determining where such spaces would go involves looking at some fairly specific grammar. For example, Natto parses １９８５ as four separate nouns. I wouldn't want to put blank spaces between them. But I would between other nouns. And I personally find that keeping verb endings together with the verb stem is useful. I would rather see たべている than たべている.

This is somewhat of a digression, perhaps, but perhaps you have some insight, so I'll write just one more paragraph. What I'd like to do with that data, at least right now, is to make tab separated text files where each line has the structure JAPANESE [tab] PHONETIC READING [tab] ENGLISH. I have the Japanese and English all done and am working on the phonetic reading. The end goal is to put that information into Anki to make flash cards. Anki supports HTML and CSS, so it might be better to use furigana (with small kana characters sitting above kanji). I haven't tried doing this kind of furigana work before, but it would alleviate the need to figure out spacing between words.

Also, I'm scared of magic numbers.

buruzaemon commented 9 years ago

Hi there, Douglas:

Regarding the issues I was hoping to raise, exactly as you say.

Fair enough.

What I will do is this: for Natto::MeCabNode, i can add comments into the source file, which will eventually end up in documentation for MeCabNode. Until I do another release, you will have to refer to the comments in the source code only, as that documentation is generated automagically with each gem version release. I am trying to keep the gem version in sync with the MeCab version.

For the formatting options of MeCab, I can put up another page on the project Wiki. Come to think of it, perhaps the only mention in English of the formatting rules in MeCab is an old SO question I answered. I always meant to write up something in English eventually. And thanks for the context. Keep in mind that natto is merely a wrapper, and so it will only do what is possible in MeCab. Your examples for １９８５ and たべている illustrate this, as that is precisely what MeCab is doing.

If what you are interested in is some form of segmentization, I would guess that it might be possible to use the posid part-of-speech id values to determine which segments should be joined and which shouldn't, and post-process the output of -Owakati or something. If you are curious, you could start by having a look at pos-id.def in your ipadic install dir.

buruzaemon commented 9 years ago

That sounds marvelous. This gem is very neat, and it's great to have you maintaining it. I'd be very happy to look on the Wiki, in the source code, or both. I'll take a look at the pos-id.def.

Ah, I was looking at the wiki yesterday, specifically the 振り仮名変換 example. The code there, when I ran it, didn't produce kana for お願い -- it left it as kanji. Apparently in some cases, but only a few, the char_type of 6 needs to be converted to kana as well. That's when I started to wonder what those magic numbers meant.

buruzaemon commented 9 years ago

Added descriptions of members to code comments in Natto::MeCabNode and Natto::DictionaryInfo.

buruzaemon commented 9 years ago

Added new wiki page on stat values in Natto::MeCabNode.

buruzaemon commented 9 years ago

Added another new wiki page on char_type in Natto::MeCabNode.

buruzaemon commented 9 years ago

Finally added the new wiki page on posid in Natto::MeCabNode.

buruzaemon commented 9 years ago

Also added Appendix F: Output Formatting and a new section 出力フォーマットの指定 under the Usage section.

This should cover the detailed documentation for using MeCab and natto.

buruzaemon commented 9 years ago

changed status to resolved

Resolved 2014-05-21. This issue was ported from Bitbucket and is archived for historical reasons.

buruzaemon / natto

Documentation of output needed #13