barrust / pyspellchecker

Pure Python Spell Checking http://pyspellchecker.readthedocs.io/en/latest/
MIT License
696 stars 101 forks source link

How to add new language #26

Closed MukhtarShaima closed 5 years ago

MukhtarShaima commented 5 years ago

Will you please give me clear instructions or steps ,so that I can add Urdu language,as I'm not able to download the Urdu file from that link which you mentioned.

barrust commented 5 years ago

Sure, the steps to generating a new language are fairly straight forward: 1) Download the set of words that should be added to the dictionary

If your data in in a dictionary form, you can load it like so:

from spellchecker import SpellChecker
spell = SpellChecker(language=None)
spell.word_frequency.load_dictionary(file_to_dictionary)
spell.export(location_for_export)

If you only have txt files with words, etc, you can just load those words directly and have spellchecker build the word frequency for you:

from spellchecker import SpellChecker
spell = SpellChecker(language=None)
spell.word_frequency.load_text_file(path_to_text_file)
spell.export(location_for_export)

Once you have exported the dictionary (really a word frequency list), you can then load that dictionary when you wish to use spellchecker:

from spellchecker import SpellChecker
spell = SpellChecker(language=None, local_dictionary=location_from_export)
MukhtarShaima commented 5 years ago

Thanx for the clear instructions,I had successfully loaded my text file. Now the problem is it does not give me correct answers eg: for word in misspelled:

Get the one most likely answer

print(spell.correction(word))

it should return the correct or most likely word,but sometimes it gives me wrong word in the misspelled, or it returns the whole misspelled string. Thank you.

barrust commented 5 years ago

That is likely due to a few different possible issues.

1) If you do not have frequency, i.e., everything is set to 1 (or the same thing). Try something like:

 # return those that are within the specified distance
print(spell.candidates(word)) 

2) If the distance between the word you are trying to correct is greater than 2, then it will not work and it will return the word, as is.

Honestly, I have never tried this with non-latin character languages so I am unsure how it will perform.

barrust commented 5 years ago

@MukhtarShaima Let me know if you are still having issues, otherwise, I am going to close this one!

Thanks!

ryuzakinho commented 5 years ago

Hi,

From my understanding, we can load JSON formatted dictionaries or text documents that will be used for building the frequency list.

I would like to directly use the word frequency lists available here (Word Frequency): https://github.com/hermitdave/FrequencyWords/tree/master/content/2018/fi

These are txt files containing frequencies. Is there a way to directly load such files or do I need to convert them to JSON first?

Thanks for your help!