barrust / pyspellchecker

Pure Python Spell Checking http://pyspellchecker.readthedocs.io/en/latest/
MIT License
694 stars 101 forks source link

Higher Levenshtein Distance than 2 #72

Open fwollatz opened 4 years ago

fwollatz commented 4 years ago

It is not possible to correct Words, that have a higher Levenshtein-Distance than 2. (At least in German).

A parameter to change this would be much appreciated.

barrust commented 4 years ago

Currently I limit the Levenshtein-Distance to 2 for all languages, including German. The reason is that for longer words (say 10 characters or longer) getting past 2 creates a very long list of possibles to check and the system slows down. It is feasible to allow larger Levenshtein-Distances and I would appreciate any help or pull requests that make that possible.

maayanorner commented 4 years ago

I think it's possibly necessary to shift from creating all candidates to O(n^2) loop on the vocabulary with some efficient variant (Mbleven, for example) of edit distance. Then the complexity will be (~) a function that depends more on the vocabulary size rather than the edit distance.

Though, from experience - I have to say it does not scale well (but better than creating all candidates, at least for long words).