Word splitter performance issues

LaurensWeyn commented 7 years ago

Perhaps it's best I write up my bigger planned changes publicly instead of keeping them hidden away inside a text file.

Anyhow, with the new deconjugator, things have gotten quite slow. Quite a lot of that has to do with the program assuming splitting is a fairly fast operation. As this is no longer the case, some work may be needed to prevent splitting things from scratch every time.

The main changes I'm planning to make are these:

Store how lines are split in the backlog, instead of just the original text.
When the user places a manual split, all unchanged text before it until the split (or until the word intersected by the split) is should be copied over from the last split.
After reaching the split performed above, when the auto splitter reaches a split that again matches what was there before, the rest can again be copied from the last split operation.

This has the advantage of keeping furigana and selected definitions when adding splits, and keeping splits correct in the log, along with the performance changes.

In addition, not having the UI become unresponsive during splitting could be a good idea, even if it's just a loading icon. In the most perfect scenario, splitting happens on a seperate thread and as words are split they appear on the UI, and the user can interact with the words while it's still going, as if the VN is scrolling the text. That would be pretty hard to pull off though, and I have nowhere near enough free time to get that all working.

I'm holding back on these changes due to the work on Kuromoji currently being done, so I'm just sharing future plans for now.

wareya commented 7 years ago

Changes 2 and 3 will basically not work with kuromoji since the context near the end/beginning of the string affects how kuromoji segments text.

IMO, the "real" solution would be to make the deconjugator run forwards instead of backwards, but it's a bit hard to implement and test that to see if it's a real solution. There's also the problem that the deconjugator passes around so much memory right now.

Kuromoji doesn't seem to have a noticable performance impact, mostly because it makes up for the segmentation pass by increasing the size of the chunks that the word splitter iterates over. But I have an overpowered CPU so I can't really tell that easily unless the performance problem is super bad.

EDIT: Actually for some reason kuromoji makes parsing /faster/.

LaurensWeyn commented 7 years ago

Yea, I would think that kuromoji might even improve the performance, though I haven't tested it.

An alternative to 2 and 3 would be to keep the words matched last time in a HashSet. For every match, before putting it in the slow deconjugator and dictionary searchers, it could check if we've had that word last time.

I'm not sure how you could make something that deconjugates run forward since it can't really tell where the dictionary word ends and conjugations start until it reverses them, though I could be wrong. It could certainly be less of a brute force. HashSets can't do wildcard searches like our brains can, but perhaps if it became a Binary Search Tree of sorts that could be possible.

The performance right now on my Intel Atom laptop, on the master branch at least, is not too great, especially for larger blocks of text.

wareya commented 7 years ago

Try out the kuromoji branch, I want to know how it performs on your end. If it works well I'll make a priority of fixing up how I integrated having two build artifacts and make a pull request.

LaurensWeyn commented 7 years ago

It seems the slowdown I've been experiencing is mainly due to some issue regarding lookup in Epwing dictionaries slowing it down during splitting, as not having Epwing dictionaries loaded has the text appear near instantly as it has before. I've been assuming it's the deconjugator since it's been since that change that it's been slow, but with University work piling up I've never actually gone back and thourougly checked what was going on.

Kuromoji certainly isn't causing any performance issues as far as I can tell though. I may do some measurements comparing System.nanos() before and after splitting text when I'm done looking into the Epwing issue to really check, but I suspect it's faster than without as it should result in less brute force attempts.

wareya commented 7 years ago

Try with something pathological, too. Like 721 characters of やっぱり覚えてやがったな. I actually get a slowdown like that, even without an epwing dictionary.

LaurensWeyn commented 7 years ago

It seems that at some point, hasEpwingDef has ended up all over the text splitter. It was originally only supposed to match on the first user segment (so if a word wasn't in EDICT, you'd middle click the start of the word and it would find it in Epwing if available) since having it match on every possible segment would be far too slow. Yet, it seems to be matching on every possible segment now.

I still haven't found the time to go through all the changes in the text splitter you've made with the new deconjugator and now kuromoji. I should really get in there with the debugger and write up Javadocs for all those functions when I figure them out.

Also not sure how you managed to set up this basic/heavy thing. It seems they're projects within the project? And somehow IntelliJ downloaded kuromoji from Maven even though there's no maven pom file. Not sure how that works, but it's pretty cool.

wareya commented 7 years ago

Ah, so that's how the epwing thing is supposed to work. The master branch still works that way, right? I removed the "start of phrase" check from the kuromoji branch since I just refactored the word splitter (I had to make a pretty major change to work around something), I can make it avoid using epwing more often if necessary.

Yeah, the basic/heavy stuff is a mess. I'm going to make it so that the normal spark reader artifact works basically the same way it used to, and the "heavy" module links it in directly. I'll make sure to comment more of my code while I'm at it.

intellij is actually pretty good at dealing with maven projects, yeah, I'm impressed.

LaurensWeyn commented 7 years ago

Well, after merging the recursive deconjugator, the master branch gained the bug. Before that merge, it should only attempt epwing lookup on the very first word and then any additional word that starts with a manual split.

That feature was really only there for the sake of completion for those who liked their Epwing dictionaries (I don't see why; they don't even tell you basic things like what words are nouns or verbs, so words can't be deconjugated) and most words in Epwing show up in Edict anyhow. As much as I dislike it, I still feel like some functionality like it needs to be there for it to have true Epwing integration.

wareya commented 7 years ago

I don't doubt that I might've broken it, I don't have an epwing dictionary set up in spark reader. Sure enough, my "initial segment" code doesn't have the firstSection check.

LaurensWeyn / Spark-Reader

Word splitter performance issues #9