LaurensWeyn / Spark-Reader

A tool to assist non-naitive speakers in reading Japanese
GNU General Public License v3.0
31 stars 7 forks source link

Question: kuromoji #4

Open wareya opened 7 years ago

wareya commented 7 years ago

I recently figured out how to use kuromoji, and theoretically speaking, I could see what happens if I force the segmenter to only create a segmentation if it's in a place where kuromoji put one. This would still allow the deconjugator to work, and it would fix cases like Nはこいつ, which my fork (and presumably the original) of spark reader currently splits wrong. Basically it would prevent the parser from making segments that are inconsistent with where kuromoji draws the boundaries between lexemes.

If I wanted to just fix this one example, I would probably add a way to say "if はこ is followed by いつ, rewrite them into は followed by こいつ", which is really awful. Either that or making the segmenter much smarter than it is right now, which would basically reimplement half of mecab or kuromoji.

The problem is, kuromoji is like, really big. The smallest version of it, kuromoji-ipadic, comes in at 13MB. If kuromoji were added in the way described above, it would probably be easy to make two builds of spark reader: one with kuromoji compiled in, one without it. If it were added in an invasive way that lets the segmenter rely on kuromoji more, you couldn't do that unless you were to encapsulate and abstract kuromoji's segmenting behavior and provide a "dummy" version of it if kuromoji is compiled out or not available.

Spark reader's "dumb" segmenter is very good at separating terms that don't inflect and don't have confusing overlap, but kuromoji is good at syntactically complex stuff like strings of hiragana and conjugations. In the few cases kuromoji might do something wrong with separating terms, like complicated katakana names, I don't know the ways it might go wrong. Presumably, the fact that spark reader lets the user "fix" parsing mistakes would gloss over the problem.

There's also the fact that kuromoji is, itself, more code, so if it were added in a simple way, it would probably just slow down the segmenting process down. If kuromoji were added in an invasive way, it might actually make the segmenting process faster, since the only time you would have to look forward in kuromoji's segment list is if you're looking at a segment that can inflect (<-- big deal!)

LaurensWeyn commented 7 years ago

I'm alright with you adding alternative modes, but as you said perhaps make them all optional. I wouldn't know how to setup the build thing to only include some libraries sometimes and not cause compiler errors when those libraries are excluded. Right now, if you look at the code, JNA is always included and there's code in place that essentially disables it when on Linux (even though JNA isn't used for anything useful yet even on Windows). The new Epwing support works in a simmilar way.

Perhaps don't initialise kuromoji if the user decides to use the internal splitter, or no splitter at all, but do still have it there.

On a related note, is it possible to add your new deconjugator alongside my 'legacy' one, and let the user choose which one? Yours would probably become the default, but I'd like to keep the old one even if only for testing purposes (unless that becomes hard to maintain due to other changes, then I'll have to drop it of course)

wareya commented 7 years ago

Thanks for the response.

I'll polish up the deconjugator changes and make a pull request before looking at kuromoji. Making it coexist with the old one is a good idea, and the one thing java's invasive OoP is good for.

For kuromoji I might look for a way to make it a separate artifact, though I have no idea if that in particular is a good way to do it. In the meantime I wrote a frequency list generator with it: https://github.com/wareya/analyzer

wareya commented 7 years ago

Added it to a branch. It's ready for me to start testing it. https://github.com/wareya/Spark-Reader/tree/kuromoji

image

Since it still has the same text splitter underlying it (trying to piece segments together instead of characters), things like 私はそう still segment as 私|はそう instead of 私|は|そう, but that's just how it's going to be unless the splitter is allowed to be aware of the morphological information kuromoji gives for each segment. This is a lot easier to fix with a blacklist than the general "bad parsing of hiragana sequences" thing would be, though.

LaurensWeyn commented 7 years ago

That's very cool.

Can you still override it? For example, if it were actually ことし but kuromoji forced the parser to see a split there, can you still force it to read ことし with manual splits?

The vague initial idea I had was to have kuromoji place 'manual splits' (perhaps a different color to avoid confusion) for the old parser that the user could remove or move elsewhere to fix errors, but if this method can be overridden as well it may be a neater solution.

wareya commented 7 years ago

Kuromoji is only invoked after split() already does its job, inside splitSegment(), so manual splits still work. I'd never add anything that if it meant manual splits wouldn't work.

Kuromoji's segmentation is too granular for it to work well if it made splits everywhere it made a split. Even compound nouns would be split in half most of the time. Though I am considering a way for the segmenter to say, "no, definitely don't combine this segment with the one right after it".

wareya commented 7 years ago

The kuromoji branch is starting to feel mature. I use heuristics to coerce the word splitter to avoid making certain known bad splits, ones that aren't handled well by a blacklist, and I seem to have the right pieces in place to make it work well. The word splitter is still kind of complicated. There's a lot I want to do, but it mostly has to do with improving the heuristics or possible code quality problems with the word splitter.

If there's anything you want to say about the current word splitter in the kuromoji branch, I'd be happy to hear it. Even I think it's kind of nasty right now, but I can't see a straightforward way to simplify it without making it too abstract or breaking some of my regression tests.

LaurensWeyn commented 7 years ago

I've had a quick look at the code.

wareya commented 7 years ago

For the current blacklist system: if a word is blacklisted, can it currently be removed from the blacklist through the UI? if it's never matched I'm not sure if it will show up to let you untick that option.

Not yet. If there is, it would probably be through the settings window. Right now, you can use the dropdown menu to un-blacklist something you blacklist on accident as long as the parser doesn't rerun in the meantime. I've had to delete things from the blacklist two or three times already, so I know a general way to delete things from the blacklist is necessary.

perhaps the blacklist can be taken into account right after loading the dictionary, where blacklisted words are removed from the lookup data structure, and perhaps moved to a separate data structure for some UI to allow removing them or for some 'forced' lookup mode if it's ever added.

The reason I didn't do this is to keep it simple when I add a way to remove blacklist entries through the settings menu etc. I haven't really thought about the implication of doing the blacklist any other way. It just seems bloated for me to rebuild the definition list whenever the word splitter checks a definition. Maybe I could make a dictionary "cache" and only use it as long as the blacklist hasn't been changed.

The project seems to still be setup with a sub-project for 'heavy' mode, with 2 entry points. Obviously there will need to be some way to switch to the Kuromoji parser through the UI.

The idea is the non-heavy project is for people that want to use spark reader as a rikai replacement, so they don't have to worry about memory usage. The heavy one has the ability to turn kuromoji off completely in the settings, and act the same way as the non-heavy project.

LaurensWeyn commented 7 years ago

I've now tried actually using the kuromoji branch for actual reading.

First off, it's certainly quite an improvement in terms of immediate accuracy to the old parser, which is great. I find myself rarely having to correct it. However, when it does get things wrong, I have trouble having it match words that I know it should be able to match.

This happens a lot on names like 沙夜 and 暁斗. These are both in the user dictionary, but most of the time neither of these will match as one word and instead match as individual characters, even when placing manual splits before and after them. The do work sometimes, however. It seems to depend on the sentence. This isn't just on user dictionary words, もふもふ when written in hiragana refuses to match as well, but in katakana it does just fine.

Though you could fix this with heuristics, there should be some way to force matching, perhaps falling back to the non-Kuromoji parser for the first word after a user split (or word just before, or both) and parse the rest with Kuromoji. I'm not sure how well that would work since Kuromoji likely needs the whole sentence to parse the grammar properly. There's also some cases that I can't possibly expect Kuromoji to deal with well, like さをとめ, which should match as one word (It's the name of some store, appears quite often). I doubt there's an easy way to tell Kuromoji that the を is not grammatical there, even with manual splits if it's still trying to understand the whole sentence.

(On the topic of failed forced matches, 天ノ川 is the spelling of 沙夜's surname, and even without Kuromoji being active it hasn't been matching, presumably due to the 'split on writing systems' change. Quite a minor issue since it doesn't come up often, and I did spot some comments mentioning that turning off splitting on writing systems should be an option, so you've likely considered this already)

Other than the minor force matching issue, it's pretty impressive stuff, both your work and Kuromoji itself. I wish I understood Japanese grammar enough to be able to try and make something simmilar.

wareya commented 7 years ago

If a name is refusing to be parsed as a whole, either heuristics are enabled and there's a problematic heuristic, or heuristics are disabled and there's a bug in the word splitter. Please let me know about cases like this, with the full sentence spark reader is working on.

Heuristics are disabled by default because, even though they fix parsing errors, they can make it so that you really can't fix parsing errors, since the heuristics work by forcing segmentations.

Forcing matches would be interesting, but basically a second layer of heuristic imo. If the word splitter isn't bugging out and heuristics are disabled, nothing should stop the parser from identifying a particular word.

As a side note, I make a distinction between "heuristics" (which only work one way right now and happen in a specific part of the program logic) and "hacks" (which are arbitrary, happen in multiple places, and are enabled by default). The tests reflect this.

LaurensWeyn commented 7 years ago

Heuristics was on (this is actually the default at the moment), turning it off fixed some issues, but I'm still having problems even with heuristics off in matching words. I've tried to show the issue here, the example text in this case is

その一つを見て、沙夜の目つきが険しくなる。

if you want to try debug it. Add 沙夜 as a user dictionary entry.

Interestingly, having just the input be 沙夜 also fails, but 沙夜の will parse. It doesn't do this on most other words and turning off Kuromoji fixes it.

wareya commented 7 years ago

This is happening because 沙 and 夜の目 are both valid words, and that's how kuromoji is segmenting them. This is still very unintuitive. We need a way to tell kuromoji that custom dictionary entries exist.

When you place a segment after 夜 it doesn't work even though it should, but it's not a kuromoji problem, there must be something else I broke somewhere. I'll have to find it and make a test.

When the segments are all marked as valid words and the word splitter can find each of them in the dictionary, there's no way to know when kuromoji's segments are actually bogus and it should only work on the per-character level. So to really fix this, it would be necessary to fork kuromoji so that we can teach it about custom words at runtime. I want to do this anyways for my text analyzer, so it won't be too much of a problem, but it'll take a while for me to get around to it.

wareya commented 7 years ago

Fixed the issue where manual splits wouldn't let it work. Improving the initial segmentation is up to either blacklisting 沙 or 夜の目, or me forking kuromoji like I plan on doing.

LaurensWeyn commented 7 years ago

Expecting it to work automatically with all special case user dictionary words, especially ones like さをとめ (which, interestingly enough, parses correctly fairly often as long as heuristics are off) is unreasonable for any non-sentient program. If it's fixable by manual splits it's fine.

I don't/didn't know if Kuromoji actually had a vocabulary, I thought it was just understood grammatical structure. Good luck if you attempt to fork it.

wareya commented 7 years ago

Kuromoji has a lexeme surface form database that looks like this:

[...]
小野原,4790,4790,9770,名詞,固有名詞,人名,姓,*,*,[...]
おのぶ,4789,4789,10366,名詞,固有名詞,人名,名,*,*,[...]
お信,4789,4789,10366,名詞,固有名詞,人名,名,*,*,[...]
お延,4789,4789,10366,名詞,固有名詞,人名,名,*,*,[...]
尾登,4790,4790,9770,名詞,固有名詞,人名,姓,*,*,[...]
[...]

The third number in the value list is some kind of weight parameter. How kuromoji works is it finds every possible surface token at every point in the sequence and makes a graph of their possible weights and possible connections. It uses the Viterbi algorithm for this. You can visualize it by going to their demo page, putting in something short like 我が名は, and selecting "Viterbi" mode.

I think kuromoji contains heuristics about what kinds of tokens can connect to what other kinds of tokens and how likely those connections are, and that's why it can pick from different versions of identical tokens. This might be part of the Viterbi algorithm, I don't know, I don't know how Viterbi even works, just that it's what it uses.

To make kuromoji recognize a particular word, it needs to have a surface (as written in the text) form of it present in the lexeme list. For my analyzer I have a several kilobytes of proper names stuffed into that lexeme list, with maximally unlikely weights. The main problem is that the lexeme list is a binary blob generated when kuromoji is compiled. You can't really extend it at runtime. That's why I have to fork it, so I can add a way to extend the lexeme list at runtime. The dictionary my analyzer users lacks a lexeme for 自転車, so I have to make it so users can add custom entries.

FWIW, the Viterbi algorithm isn't really applicable to what the current word splitter is capable of doing, since the Viterbi algorithm is about "hidden Markov information", and kuromoji is already doing the best you can do with that type of information. It's wrong or misleading about readings and grammatical behavior often enough that its accuracy level is unsuitable for user-facing parsing, which is why I limit its involvement to creating "segments". Its accuracy is still good enough for corpus linguistics, since corpus linguistics is inherently probabilistic and a 1% or 2% error rate is just noise. Mecab has a worse error rate, and was still considered usable for corpus linguistics. Mecab uses a viterbi algorithm too, which is why I think kuromoji does more than just use the weights from the viterbi graph directly.

wareya commented 7 years ago

It turns out kuromoji supports custom dictionaries, it's just not documented properly. It can load them from an InputStream while initializing the tokenizer. They have to be in the same format as whatever dictionary kuromoji is using internally, but for our purposes we can fill it with dummy data aside from the surface form and weight.

wareya commented 7 years ago

Kuromoji now understands the user dictionary at a basic level. This should fix some parses. There's some caveats to how I format the dictionary for it right now (I treat everything as a noun with maximal cost) but it's a good start, especially considering most custom dictionary entries will act as nouns.

wareya commented 7 years ago

I might change Kuromoji over to the Unidic version. The reason I didn't do so before is because it's bigger, but the Unidic parsing is a lot better than the IPAdic parsing. And the whole point of Kuromoji is to make it better for people that aren't just gonna use Spark Reader the same way as Rikai.

LaurensWeyn commented 7 years ago

I wonder if we could get it loading from the included Edict files. That way the matches will be more relevant (using another dictionary likely matches words that don't show up in Edict or shows up as multiple words instead of one phrase, or vice versa) and the filesize would be smaller. It seems to load other data as well, like readings, which are discarded anyway.

I kind of want to try myself Viberti sounds like fun, but university work is piling up and I haven't found the time to work on Spark Reader at all.

On another note, running Kuromoji with user dictionary items causes an indexOutOfBounds exception within Kuromoji in some cases, perhaps because some of those fields are left blank when loading them.

wareya commented 7 years ago

The dictionary format kuromoji needs is pre-deconjugated chunks, and with weights to distinguish likely words, so loading edict stuff into it is out.

On another note, running Kuromoji with user dictionary items causes an indexOutOfBounds exception within Kuromoji in some cases, perhaps because some of those fields are left blank when loading them.

I'll look into that, though I'll have to figure out how to reproduce it. I'm sure there's a deterministic way for me to do it.

wareya commented 7 years ago

I rewrote the word splitter again several days ago, and I've used it for long enough that I'm convinced it works the same way it used to.

https://github.com/wareya/Spark-Reader/blob/kuromoji/src/language/splitter/WordSplitter.java#L86

The user dictionary problem was because of how badly documented kuromoji is. In the latest stable version of kuromoji, the user dictionary stuff has to be in an arbitrary lightweight format unrelated to its internal dictionary format. Providing it stuff with the same format as its internal dictionary only works for the unstable version from git. I was previously feeding it the full format instead of the crappy format. A while ago, I fixed that, it won't crash kuromoji internally anymore.