barrust / pyspellchecker

Pure Python Spell Checking http://pyspellchecker.readthedocs.io/en/latest/
MIT License
714 stars 164 forks source link

spell.correction is taking way too long time for each word #45

Closed Vichoko closed 5 years ago

Vichoko commented 5 years ago

So, i'm using spell.correction(word) method for correcting a batch of text files in spanish.

After running some tests i noticed that some texts take more than 60 seconds to correct, also i noticed that the misspelled words take 5 to 15 seconds to correct through the spell.correction(word) method.

As you can imagine, for a batch of many texts this bottle-neck in the preprocessing take several hours instead of minutes or seconds.

I haven't inspected the code of that method, but i imagine this has something to do with ranking the levenstein distance of the misspelled word with the rest of the dictionary, that is made to chose the nearest neighbour.

Maybe there could be a way to use an approximate KNN or put a time threshold to the correction logic.

You can replicate this issue using spell.correction(word) for each word of this text:

"no estar de acuerdo con la forma de militarizar la Araucaní­a, la manera presora con la cual estan atacando al pueblo mapuche.el delincuencia no ha disminuido."

barrust commented 5 years ago

I am sorry that you are having issues with how long it is taking to complete the corrections.

There are a few possible options that I would recommend:

  1. Set the distance to 1 instead of the default 2.
  2. You may want to de-duplicate the words that are being checked.
from spellchecker import SpellChecker

spell = SpellChecker(language='es', distance=2)  # loads default word frequency list

sentence = 'no estar de acuerdo con la forma de militarizar la Araucaní­a, la manera presora con la cual estan atacando al pueblo mapuche.el delincuencia no ha disminuido.'
tokens = spell.split_words(sentence)

for word in tokens:
    print(word, spell.correction(word))

Using this method, it took about 5 - 10 seconds to complete. Setting it to a distance of one reduced that time to instantaneous.