barrust / pyspellchecker

Pure Python Spell Checking http://pyspellchecker.readthedocs.io/en/latest/
MIT License
694 stars 101 forks source link

Add dictionary for Russian language #91

Closed sviperm closed 3 years ago

sviperm commented 3 years ago

I used similar dataset to create wordfreq dictionary for Russian language. Added archives in scripts and resoures folders. I tested this code, everything works fine.

from spellchecker import SpellChecker

spell = SpellChecker(language='ru')

# find those words that may be misspelled
misspelled = spell.unknown('отфильтруй по убыванию отсортируй по городу'.split())

for word in misspelled:
    print(word)

    # Get a list of `likely` options
    print(spell.candidates(word))

    # Get the one `most likely` answer
    print(spell.correction(word))

Unfortunately, you didnt writes rules and examples for PRs, so I expect you'll do other stuff (writing tests, changing versions, etc.). Anyway fell free to ask any question!

codecov-io commented 3 years ago

Codecov Report

Merging #91 (412edea) into master (0452991) will not change coverage. The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master      #91   +/-   ##
=======================================
  Coverage   99.21%   99.21%           
=======================================
  Files           4        4           
  Lines         255      255           
=======================================
  Hits          253      253           
  Misses          2        2           
Impacted Files Coverage Δ
spellchecker/spellchecker.py 99.08% <ø> (ø)
barrust commented 3 years ago

Thanks! This is amazing. You are correct, I haven't gotten around to documenting what to put into a PR or testing. I will do version updates, etc. Is there anything in particular about the dictionary that should be tested?

sviperm commented 3 years ago

Hello there, @barrust ! Sorry for the late answer! Your code is great + every function have informative docstrings. So I wanted to change some code, but decided to ask your opinion.

  1. Every clean_<language> function have this lines of code:

    # remove flagged misspellings
    with load_file(filepath_exclude) as fobj:
        for line in fobj:
            line = line.strip()
            if line in word_frequency:
                word_frequency.pop(line)
    
    # Add known missing words back in (ugh)
    with load_file(filepath_include) as fobj:
        for line in fobj:
            line = line.strip()
            if line in word_frequency:
                print("{} is already found in the dictionary! Skipping!")
            else:
                word_frequency[line] = MINIMUM_FREQUENCY

    I think it should be moving outside to separate functions. DRY principle.

  2. I wanted to change keyword args in _parse_args function. -p and -P have similar letter, but different cases. I suggest to change:

    • "-p", "--path" -> "-f", "--file_path"
    • "-P", "--parse_input" -> "-p", "--parse_input"
  3. Autocreation exlude.txt and include.txt if files dont exist.

Now about dataset, I thought similar test like test_spanish_dict for cyrilic letters will be great, but this is standart utf-8 coding, so decide yourself.

And info about initiating different languages in readme, like code below, will be great.

SpellChecker(language='ru')
SpellChecker(language='en')
# etc

Thank you for merging Russian dictionary, may be on this week I'll prepare even bigger dictionary. This PR is fine and I suggest to move discussion to disscussion tab, which you've created or issue tab for readme and etc. :)

barrust commented 3 years ago

Thanks for the feedback! I will update the README.md with language information. As for the other ideas, go ahead and put an issue in for the args and auto create those files so that it can be tracked. Thanks!