Predelnik / DSpellCheck

Notepad++ Spell-checking Plug-in
GNU General Public License v2.0
199 stars 34 forks source link

Make suggestions better for multiple languages spell-checking. #21

Open georgthegreat opened 11 years ago

georgthegreat commented 11 years ago

I've french_1990 and standard English (Great Britain dictionaries installed). I write the single word l'été: http://slovari.yandex.ru/l%27%C3%A9t%C3%A9/%D0%BF%D0%B5%D1%80%D0%B5%D0%B2%D0%BE%D0%B4/

Here is what I see. English dictionary fails to validate it (that's OK). French dictionary validates it (that's OK). Joined English-French dictionary fails to validate it (that's NOT OK).

untitled-1 untitled-2 untitled-3

Predelnik commented 11 years ago

Thank you very much, it was important though very funny bug :D

Basically the first problem that the name of this dictionary wasn't resolved correctly, mostly because zip file with it have name "fr_FR-1990_1-3-2" but dictionary itself have name fr_FR-1990.

It somehow turned out that in this list box items were sorted automatically, though they shouldn't be and that sorting was case insensitive while mine was sensitive so basically correspondence between dictionaries and check boxes was wrong.

Makes me think if I probably should make case insensitive sorting too though...

georgthegreat commented 11 years ago

Are you going to close this? Or are you waiting for me to test it?

Works well (I've downloaded new version from some issue above).

georgthegreat commented 11 years ago

Not finally fixed. If I turn on all three dictionaries (via multiple languages option), some words wouldn't have any alternatives: ômbre, räie maybe some more.

If only french is turned on, everything works fine (they will be underlined and alternatives like ombré and raie will be available).

georgthegreat commented 11 years ago

Sorry, räie word seems to work fine.

Predelnik commented 11 years ago

It was actually because of totally different matter, at very beginning I wrote it so Hunspell would have 100% hit on Russian/English dictionaries combination. Hunspell is much better about language guessing done my way, so it could be safely removed. You could check it out at the usual link: http://goo.gl/OYqRO

georgthegreat commented 11 years ago

Still isn't working for me.

The word "developpé", which is of French origin, suggests only English alternatives, though accents aren't used in English words.

Why do you need this language definition at all?

Predelnik commented 11 years ago

Well the current way of determining language guess is to choose one which have most suggestions, for this word we get 2 suggestions for English and 2 for French, and since English is first -- it's being selected as current.

The good way to solve this, maybe - if multiple languages selected - show another menu item where you can select language for this word, so all suggestions and adding to dictionary would be for this language. Probably it's better to do so for current session only 'cause saving a lot of stuff like that is pain, at least it will add possibility to add such words to dictionary and forget about them for the time being.

georgthegreat commented 11 years ago

Isn't it possible to simply join the suggestions in one list?

Predelnik commented 11 years ago

It's possible but there is the problem when there is a lot of suggestions, in which order feed them to a list. That of course have a solution of just putting one from first language, one from second and so on (if they have them at all) until maximum is reached.

But there's still a problem of determining in which language user dictionary I should put the word to, maybe though it could be solved by doing "Add to Dictionary..." item as a submenu with languages selected as items. Probably with showing how much suggestions from each language there are (in parenthesis)

Ok that seems like a good idea, most likely I'll do it))

georgthegreat commented 11 years ago

Hunspell doesn't have any kind of difference between words?

Are you sure that non-unified user dictionary is required? Libre/Open Offices don't have such feature, do they?

Predelnik commented 11 years ago

What do you mean by difference? If you mean like distance function between words, well it's not public definitely, I could try to look for it though.

I don't know if it's like required 100%, but it seems to be logical actually, since there are users who switch between languages to check the text rather than use multiple languages.

georgthegreat commented 11 years ago

Yep, the distance function is what I was talking about.

georgthegreat commented 11 years ago

Here is one more example of bad usability: When both English and French dictionaries are turned on, the word reunis suggests English reunion, but not French réunis, which should be much more close to the original.

This also might be caused by wrong utf-8 handling (réunis is something like r'eunis in utf-8).

Predelnik commented 11 years ago

Btw if it wouldn't bother you, you can check this preview of next major version http://goo.gl/OYqRO I used Damerau–Levenshtein distance for the words (case-insensitive), it isn't perfect but seems to be actually quite OK, though maybe I'll change some things later. All your example problems from this thread seem to be resolved at least)

Different dictionaries for different languages are preserved for now, but default mode now is different dictionaries for single dictionary mode and one big dictionary for multiple dictionary mode (I didn't test it thoroughly for now though)

Also - not checking of words being written like in Firefox was added in this version also (as an option but turned on by default)

georgthegreat commented 11 years ago

No problem. I'll look on it, but not right now. I think I'll post the answer in a couple of days.

georgthegreat commented 11 years ago

Seems that this update isn't working at all.

I entered french word entree (correct is entrée). List suggests:

  1. Entree
  2. en-tree
  3. en tree
  4. entere
  5. entre
Predelnik commented 11 years ago

It's working but all this words sadly has equal distance from entree, which is 1.

georgthegreat commented 11 years ago

Hm... Then this metric (Damerau–Levenshtein) doesn't fit, does it? As far, as I see, editing, inserting or deleting single letter — all have the save weight. Seems to be incorrect. Is it your implementation or some library function? It is possible to edit weights?

Predelnik commented 11 years ago

Most likely it's possible but I need to look deeper for now I just copy-pasted some algorithm for my needs))

Predelnik commented 11 years ago

Well sadly even if I change the cost of operations to make substitution cheapest there are 3 words with the same distance entrer, entres, entrez, and since I sort them alphabetically it, entrée end up being last of them, while Hunspell manages to successfuly place it first.

Well I think it would be better to have correct weights for each letter ( like to make exchange of similar or close by keyboard letters to be cheapest operation) but not sure that this thing that is very easy to do.

Actually I've had some ideas about slight modifying of Hunspell source to allow me the merge of it's lists of suggestions, maybe I try that also.

Not sure how to test it all though, I only have some tests of common misspellings from the Aspell site, but they are not 100% reliable))

georgthegreat commented 11 years ago

I think that there is no "correct" method — any algorithm would have exceptions.

Predelnik commented 11 years ago

Yeah you're right, but with having good statistics about common misspelling all this stuff could be optimized further and further to nearly perfect)) Well at least it all deserves a little bit more of attention from my side, thanks for an example where all goes wrong)