AnySoftKeyboard / LanguagePack

[deprecated]
Apache License 2.0
117 stars 188 forks source link

extending dictionaries #352

Closed knezi closed 4 years ago

knezi commented 4 years ago

Czech dictionary is very unusable due to the insufficient size of the dictionary. Often, words are present, but not in all forms (an example https://en.wikipedia.org/wiki/Czech_declension), which makes it very hard to use word completion and autocorrection.

The ASK Czech dictionary contains approx 200K words, whereas aspell dictionary generates almost 5M. I realise that aspell may be overgenerating, that is produce words that actually do not exist (even though I haven't found any by briefly skimming through the words).

All in all, it seems as we may be able to extend the dictionaries for many languages. This could improve the usability a lot, especially for flective languages.

The questions are:

Is 5 million entries too many? (aspell stores only the stem and then generates all forms)

Aspell is missing the frequencies of words, is that a problem?

Are there non-existent words in aspell? If so, is that such a big deal?

I can provide the data files and scripts for generating them if needbe.

menny commented 4 years ago

At the moment, AnySoftKeyboard can not serve so many words. I can't recall the limit, but it?was set way back when phones were limited.

menny commented 4 years ago

We'll need to revise that code to support larger sets.

menny commented 4 years ago

Word frequency is very important for word suggestion. Do you think you can come up with a way to generate or guess the frequency of a word?

menny commented 4 years ago

Also, this repository will be closed soon. All source code is moving to https://github.com/AnySoftKeyboard/AnySoftKeyboard/ . Can you open this ticket there?

knezi commented 4 years ago

Closing in favour of https://github.com/AnySoftKeyboard/AnySoftKeyboard/issues/2005.