Handling of compound words

gary-host-laptop commented 2 years ago

Right now Spedread handles compound words in a poor way in my opinion, since for example phrasal verbs are separated which can cause the reader to misinterpret a word or get confused. So words like "come out" are displayed as "come" and then "out", another super common error is with words that use 's, 're, which are almost ubiquitous in English and causes a very annoying loss of time. This does not only affects English, for example the word "sisters/姉妹" in Japanese is a compound word formed by two kanji that is read しまい (shimai) because it uses the onyomi reading, but on Spedread this would be shown as "older sister/姉" with the kunyomi reading of あね (ane) and then "younger sister/妹" also with the kunyomi reading of いもうと (imouto) which can cause issues.

I think this is probably going to take a lot of work to improve, since it seems something that would need either a lot of code and some algorithm to interpret the words in context, or which maybe could be worked out with the integration of dictionaries into the application where it parses for words to know how to separate them, but this could also make Spedread work slowly when inputting text, since it needs to analyze it first. This last loading time, in my opinion, is worth it before having to lose time while trying to improve reading speed.

gary-host-laptop commented 2 years ago

Sorry for creating so many issues in so little time, and I know a lot of them will take time if they ever get implemented, I just wanted to point out how it can be improved.

Darazaki commented 2 years ago

Don't worry about creating a lot of issues, I'm very grateful for that! :)

Darazaki commented 2 years ago

Right now this issue is a bit of a pain to implement because Spedread will need an understanding of compound words in various languages, not just some basic universal algorithm

To add to this issue numbers like "22.5" and "100,000" are getting split into several words, just like acronyms like "U.S.A.". This should be easier to fix for now but having an understanding of compound words in at least some languages would be great

gary-host-laptop commented 2 years ago

Uh, yeah, I didn't think about numbers, but it makes sense. What about having a dictionary only with compound words? Maybe that could be easier since it wouldn't be such a huge dictionary, and the rest of the words don't need one since they are normal words. Even though one for each language would be needed I think it can be build with time, maybe through user input or something? But yeah, it's pretty hard to solve. :/

Darazaki commented 2 years ago

The dictionary is a good idea, I'll probably have a file where each line is a compound word and that'll get translated into a Vala sorted array or maybe a binary search tree at compile time

So something like this:

you're
come out
姉妹

Will be turned into something like this:

MyDictionary build_compound_word_dictionary () {
    var dict = new MyDictionary();
    dict.insert("you're");
    dict.insert("come out");
    dict.insert("姉妹");
    return dict;
}

Maybe the order in which words will get inserted into the dictionary will be changed at compile time to save some time at runtime

I'll need to experiment with this

gary-host-laptop commented 2 years ago

Good idea, but it would be best practice to create individual dicionaries for each language and have some setting to choose from, otherwise the list would get enormous and it wouldn't be optimal. If you want to do it this way let me know and I could try to find a list of compound words/acronyms to create the dictionary. When it comes to numbers, though, I think it would be better to program it in a way where "XX.XX" is recognized as a single "word" without caring about the amount of numbers it contains, because the possibilities are too big, I don't know anything about Vala so I don't know how it could be achieved but I'm sure there's a way for it.

Darazaki commented 2 years ago

Yeah, having individual dictionaries is probably the way to go, especially considering the memory usage and lookup time (the dictionary will probably be a radix tree). Thanks for the suggestion! :)

As for numbers, I'll probably implement a rule where any sequence of digits with ., , and/or a non-newline whitespace is a single word

Acronymes I'm not sure. I'd like to have a rule like "unique characters separated by . are a single word" but I don't know if there are languages that would use something other than the exact character .

Also, if you can manage to find a list of compound words that'd be very helpful! If you want to submit a PR you can just put the dictionary/ies wherever, I'll sort through them later on :)

Right now, I have a lot of exams coming up so it'll take some time but I promise I'll do all of this ASAP!

Darazaki / Spedread

Handling of compound words #5