barrust / pyspellchecker

Pure Python Spell Checking http://pyspellchecker.readthedocs.io/en/latest/
MIT License
694 stars 101 forks source link

Update dictionaries #75

Closed cmaureir closed 3 years ago

cmaureir commented 3 years ago

Hello, First thanks for this project it was really cool to discover it :tada:

I was playing around using the spellchecking functionalities, but then I noticed I started to get unknown words, which were quite "known to me", so I went to check the dictionaries, and it seems those were not there.

Checking the README, I noticed that you base the word list from FrequencyWords so I went there and I looked for these couple of examples of not found words, and they were there.

Particularly, I'm interesting on the Spanish dictionaries, so I downloaded the es_full.txt and created a .gz file, but when I saw the sizes...I didn't want to submit a PR with a new version for the es.json.gz file :fearful:

472K    es.json.gz
6.9M    es_full.json.gz
300K    es50k.json.gz

I noticed the current es.json.gz includes more words than the 50k version, but less than the es_full, so I would like to know what's the condition to create the dictionaries, so I can maybe submit a PR with an updated file not as large as the 6.9M from the full version.

Thanks for your time :smile:

barrust commented 3 years ago

Sorry this has taken me so long to get to; in essence, the dictionary is a json object with the word as the key and the count, or frequency, as the value.

As for how I built them originally, I do not remember! I did find that compressing with gunzip outside of python did make the files smaller than doing it within python.

One thing to look out for are that some of the words are likely erroneous and have unusable characters.

So it could be the same word with different capitalization, incorrect ('s) as reported in #56 or other issues such as abbreviations, etc.

As always, any Pull Requests would be appreciated!

barrust commented 3 years ago

I am currently working on a script to rebuild the dictionaries so that it can be part of the repository. I am starting with the English dictionary and then will move to Spanish. I am hoping that this will resolve this issue along with #65 and #56. Part of this is going back to the original source (http://opus.nlpl.eu/OpenSubtitles2018.php) to get the subtitles directly and to clean up.