aarondandy / WeCantSpell.Hunspell

A port of Hunspell v1 for .NET and .NET Standard
https://www.nuget.org/packages/WeCantSpell.Hunspell/
Other
126 stars 19 forks source link

Infix support for Kurdish language #39

Closed mhmd-azeez closed 11 months ago

mhmd-azeez commented 5 years ago

I am trying to make a spell checker for Kurdish. The problem is, Kurdish relies a lot on infixes (mostly because of clitic pronouns). I'd appreciate it if you provide any guidance on what's the best approach for a language like that.

If Nuspell supports Infixes

That's great news, I'd rather create a Nuspell dictionary than a custom library on my own.

If Nuspell doesn't support Infixes and there aren't any reasonable ways to work around that limitation

I have noticed that Hunspell uses very little memory and is quite fast. So if I want to create a custom library for Kurdish, I want to know which algorithms Hunspell uses.

Here is a general idea of what I am trying to accomplish:

Consider this word: Bexshin (Forgiving), It can come in these forms:

Instead of a list of words, we can have a list of patterns like so: Bi{pronoun}bexsh{pronoun}

More Examples:

Eat (Dexo{pronoun})

Can be represented as:

Work (Kar{pronoun}dekrid)

So I need an algorithm to very quickly tell me what are the closest matching patterns, and then I can expand only those patterns and based on the Levenshtein distance to the input word give back a list of suggestions.

I know that I can read the source code, and I will. But it'd make my job much easier if you gave me a few leads on which algorithms can be useful based on your experience.

aarondandy commented 5 years ago

Interesting, I can't seem to find any Hunspell compatible dictionary files out there. This project is just a port of the orignial Hunspell which can be found at hunspell/hunspell. That said, if my assumptions are correct, and you have a dotnet background my port is going to be a lot easier to follow along with. I'll be honest, I don't completely understand how it all works, but let me see if I can dig up some clues for you.

So first up, I am totally ignorant to the language but it appears to be right to left which may need to have some special treatment for complex affixes. Within Hunspell this seems to be referred to as a "Complex Affix" language and will set of a ton of string reversals in motion. Another thing to consider, is to be sure to encode your files you would make as UTF-8, it just makes it all so much simpler!

Regarding the Levenshtein distance, I don't know if that is implemented exactly for suggestions, but there is a whole lot of code that runs as part of the suggestions that is at least very similar in how it operates. It's not pretty, but it all starts around here: https://github.com/aarondandy/WeCantSpell.Hunspell/blob/master/src/WeCantSpell.Hunspell/WordList.QuerySuggest.cs#L504 . If you have a test runner that includes test coverage such as NCrunch or the new Visual Studio test runner you can use that to find tests that will cover interesting areas and step through them. The test coverage is pretty decent and can be a huge aid in understanding how it all works.

Hope that helps, getting a new language into Hunspell would be pretty cool.

aarondandy commented 5 years ago

Another thought: again I'm no linguist and have no idea what I am talking about but the German language may have some similarities in the way it forms what would be referred to in Hunspell as "Compound Words"

mhmd-azeez commented 5 years ago

@aarondandy thank you very much for your reply, your port is definitely a huge help for me. The problem is that there is not much documentation about Hunspell, and creating dictionaries for it. I'll take a look at the German dictionary, to see what I can understand.

aarondandy commented 11 months ago

I'm not sure if you have come across this yet or not but maybe this will help: https://github.com/sinaahmadi/KurdishHunspell . This issue is pretty stale and I'm not going to be much help with it, so I'm going to close it. If you still are trying to solve this, the larger community of users in https://github.com/hunspell/hunspell/issues will be more helpful than me.