spell.correction is taking way too long time for each word

barrust / pyspellchecker

Pure Python Spell Checking http://pyspellchecker.readthedocs.io/en/latest/

MIT License

714 stars 164 forks source link

So, i'm using spell.correction(word) method for correcting a batch of text files in spanish.

After running some tests i noticed that some texts take more than 60 seconds to correct, also i noticed that the misspelled words take 5 to 15 seconds to correct through the spell.correction(word) method.

As you can imagine, for a batch of many texts this bottle-neck in the preprocessing take several hours instead of minutes or seconds.

I haven't inspected the code of that method, but i imagine this has something to do with ranking the levenstein distance of the misspelled word with the rest of the dictionary, that is made to chose the nearest neighbour.

Maybe there could be a way to use an approximate KNN or put a time threshold to the correction logic.

You can replicate this issue using spell.correction(word) for each word of this text:

"no estar de acuerdo con la forma de militarizar la Araucanía, la manera presora con la cual estan atacando al pueblo mapuche.el delincuencia no ha disminuido."

from spellchecker import SpellChecker spell = SpellChecker(language='es', distance=2) # loads default word frequency list sentence = 'no estar de acuerdo con la forma de militarizar la Araucanía, la manera presora con la cual estan atacando al pueblo mapuche.el delincuencia no ha disminuido.' tokens = spell.split_words(sentence) for word in tokens: print(word, spell.correction(word))

barrust / pyspellchecker

spell.correction is taking way too long time for each word #45