barrust / pyspellchecker

Pure Python Spell Checking http://pyspellchecker.readthedocs.io/en/latest/
MIT License
694 stars 101 forks source link

English spellchecking #84

Closed mrodin52 closed 3 years ago

mrodin52 commented 3 years ago

Hello Team! I am new to the Project and I have a question.

I use python 3.7 and run into problem with this test program:

from spellchecker import SpellChecker
spell = SpellChecker()                         
split_words = spell.split_words
spell_unknown = spell.unknown

words = split_words("That's how t and s don't fit.")
print(words)
misspelled = spell_unknown(words)
print(misspelled)

With pyspellchecker ver 0.5.4 the printout is:

['that', 's', 'how', 't', 'and', 's', 'don', 't', 'fit']
set()

So free standing 't' and 's' are not marked as errors neither are contractions.

If I change the phrase to:

words = split_words("That is how that's and don't do not fit.")

and use pyspellchecker ver 0.5.6 the printout is:

['that', 'is', 'how', 'that', 's', 'and', 'don', 't', 'do', 'not', 'fit']
{'t', 's'}

So contractions are marked as mistakes again.

(I read barrust comment on Oct 22, 2019}

Please, assist.

barrust commented 3 years ago

So the issue is in the split_words() function. It uses a simple regex to split contiguous letters out. So that's -> that s as two words. Try splitting on white space instead of using the utility function.

Note that the contraction isn't marked as a mistake, it is that they are turned into more than one word. So don't becomes don and t; don is a real word in English but t is not. don't should be checked, as is. The issue is that split_words() isn't maintaining contractions.

mrodin52 commented 3 years ago

I am afraid that is not a solution since there are punctuation signs (see the last word in my example), and " fit." is placed into misspelled.

By the way, what is the difference between ver 0.5.4 and ver 0.5.6 that produced different spelling results?

barrust commented 3 years ago

You can see the information in the Change log as to the differences. The biggest are new dictionaries that attempt to fix these exact issues, a fix for python 3.9, and removing python 2.7 support.

As for how to parse your string, that isn't really this libraries goal. The goal is to be simple to use and pure python and to not require any dependencies.

I used the NLTK WhitespaceTokenizer to build the dictionaries (non-spanish). It is up to you to figure out how you would like to parse your text to make it testable. If there is a good method that can be used to update the simplistic split_words() function, then a PR would be greatly appreciated.

For your instance, perhaps something like this would work:

from spellchecker import SpellChecker
spell = SpellChecker()

words = "That is how that's and don't do not fit.".split()
misspelled = spell.unknown(words)  
# NOTE: this is based on a simple split. Up to the user to figure out what is best!
# This example is only dealing with trailing punctuation, not leading. 
for w in misspelled:
    if w.endswith(tuple([".", "?", ",", '"', "'", "!", "]", ")"])) and w[:-1] in spell:  
        # the word is not misspelled, it was punctuation!
        # likely, you would want to make sure there are 
        # not more punctuation in a row, etc. But this is a 
        # possible solution for your exact problem. 
       print("({}) is not misspelled!".format(w))
mrodin52 commented 3 years ago

Understood. Thank you very much.

barrust commented 3 years ago

perhaps something like this would work?

From StackOverflow:

rgx = re.compile("(\w[\w']*\w|\w)")
s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!' 'A a'"
rgx.findall(s)

If this makes sense, I can update the basic split_words() function to do something like this.