barrust / pyspellchecker

Pure Python Spell Checking http://pyspellchecker.readthedocs.io/en/latest/
MIT License
694 stars 101 forks source link

Russian preprocess and code editing #96

Closed sviperm closed 3 years ago

sviperm commented 3 years ago

Hello! There is a lot of code editing. 0) #95 #94 1) Add preprocessing script for russian language, based on 8Gb lenta.ru corpus 2) Add Cleaner and Pipeline classes, to reduce repetetive code in build_dictionary.py.

This chacnges still work in progress, needs to do a lot, such as docstrings, tests, etc.

I want to hear your opinion, should I need to continue working on this or not, because code become more complex, but at the same time logic bacome splited to the files. I prefer new approach.

Logic behind Cleaner and Pipelines:

# we have word frequency dictionary
word_freq = {...}
# initiate cleaner
cleaner = Cleaner(language='ru')
# use `clean` function to mutate original word_freq dictionary and return invalid_words (aka misfits)
invalid_words = cleaner.clean(word_freq)

Inside cleaner instance we choose one of our pipelines to preprocess specific language.

Another question I would like to ask is can I use typing hints in final code, for example:

def __call__(self, word_frequency: dict) -> list:
    misfits = []
    for pipe in self.pipeline:
        misfit = pipe(word_frequency)
        if misfit:
            misfits += misfit
    return misfits

I'm asking this, because I saw in CI checks old python version compatibility. Feel free to criticize code and ideas!

sviperm commented 3 years ago

Also sorry for combining three issues in one PR, I realise, that it would be a bad idea, when I've already pushed commits to GitHub. I have little experience in open source PRs :)

barrust commented 3 years ago

So, I like the idea as a whole. My biggest issue is in that it is making a high level of entry to be able to add a new dictionary language. This is more engineering than I am hoping to impose on people who want to add a new language. This is why I favored a simple script for this instead of a object oriented approach.

As for type hinting, it is something I want to add but I am not yet ready to stop supporting older versions of python 3. I believe it is available once the next version of python 3 is decommissioned, but I could be misremembering.

sviperm commented 3 years ago

Agree with you in all points. It was easy for me to just copy-paste english cleaner and adopt it for russian. Everybody has different level of python programming. I'll stash changes for the best time.

I'll create another PR for #95 #94. I'll also create file generation for misfitted words, for analysis purpose (add new bool arg in arg_pars)

Also, updated version of russian dict will be moved to another clear PR.