filyp / autocorrect

Spelling corrector in python
GNU Lesser General Public License v3.0
447 stars 79 forks source link

Fine tuning and improving #17

Closed UdiKarpasBeyond closed 1 year ago

UdiKarpasBeyond commented 3 years ago

Hi, First of all - this looks great. Thanks a lot. I compared 3 different packages (yours, pyspellchecker, textblob) and yours does the best.

How can I improve performance? Is there a way to finetune this to a specific data set?

filyp commented 3 years ago

Ah, great! ^^ I was planning to do this comparison myself, so I'm glad you already did it. Out of curiosity, could you post some results from that comparison here?

As for the performance, if you mean speed, I'm not sure, it's already pretty optimized. If you mean correction accuracy, I was thinking about adding a language model, so it would decide how to correct, based on context. It would be a great improvement but pretty heavy, and would take lots of work. As for finetuning, you can follow instructions for adding new languages https://github.com/fsondej/autocorrect#adding-new-languages, but instead of running count_words on wikipedia, run it on a textfile with your data, so if it's in Engligh, you would do:

from autocorrect.word_count import count_words
count_words('your_data.txt', 'en')

and then tar the output file:

tar -zcvf autocorrect/data/en.tar.gz word_count.json

It will replace default English dictionary with yours. For best accuracy you should also experiment with different threshold values (Speller(threshold=x) and see which value works best.

Note that it's not really finetuning but retraining on your data from scratch, so you need a lot of data.

himanshudhingra commented 1 year ago

Hi @filyp Thanks for giving the provision to change the dict to own text file. And yours is the only package I found which replaces the word in sentences, else every other is working on 1 word at a time.

However, I am facing an error while doing tar to the output. Attaching screenshot. Can you please help?

Also, I see that it works fine till changes = 2, how to increase this? Eg: spell('NissSSan') returns 'Nissan' and spell('NissSSSan') returns 'NissSSSan'

Help will be really appreciated.

image

filyp commented 1 year ago

changes higher than 2 isn't supported because that would be computationally expensive and the corrections would often be ambiguous

tar is a bash command and needs to be run in bash shell, not a python interpreter (google around to see how to use bash)

although there is a trick to use bash inside python interpreter, by using !, so:

!tar -zcvf ...