Closed Vichoko closed 5 years ago
I am sorry that you are having issues with how long it is taking to complete the corrections.
There are a few possible options that I would recommend:
1
instead of the default 2
. from spellchecker import SpellChecker
spell = SpellChecker(language='es', distance=2) # loads default word frequency list
sentence = 'no estar de acuerdo con la forma de militarizar la Araucanía, la manera presora con la cual estan atacando al pueblo mapuche.el delincuencia no ha disminuido.'
tokens = spell.split_words(sentence)
for word in tokens:
print(word, spell.correction(word))
Using this method, it took about 5 - 10 seconds to complete. Setting it to a distance of one reduced that time to instantaneous.
So, i'm using spell.correction(word) method for correcting a batch of text files in spanish.
After running some tests i noticed that some texts take more than 60 seconds to correct, also i noticed that the misspelled words take 5 to 15 seconds to correct through the spell.correction(word) method.
As you can imagine, for a batch of many texts this bottle-neck in the preprocessing take several hours instead of minutes or seconds.
I haven't inspected the code of that method, but i imagine this has something to do with ranking the levenstein distance of the misspelled word with the rest of the dictionary, that is made to chose the nearest neighbour.
Maybe there could be a way to use an approximate KNN or put a time threshold to the correction logic.
You can replicate this issue using spell.correction(word) for each word of this text:
"no estar de acuerdo con la forma de militarizar la Araucanía, la manera presora con la cual estan atacando al pueblo mapuche.el delincuencia no ha disminuido."