barrust / pyspellchecker

Pure Python Spell Checking http://pyspellchecker.readthedocs.io/en/latest/
MIT License
694 stars 101 forks source link

Doen not pick :) or :') or :( as correct words even though they are in my custom text #67

Closed NomanSaleem4 closed 4 years ago

NomanSaleem4 commented 4 years ago

Doen not pick :) or :') as correct words even though they are in my custom text. What can be the issue here? Thanks

barrust commented 4 years ago

Strange, the following code snippet shows that it is working. If this doesn't work for you can you provide a minimal code sample that shows the issue?

emojis = [':)', ":')", ':(']
spell = SpellChecker(language=None)
spell.word_frequency.load_words(emojis)

print(spell.word_frequency.dictionary)

print(spell.known(emojis))
print(spell.unknown(emojis))

for itm in emojis:
    print("{} is in the dictionary: {}".format(itm, itm in spell))
NomanSaleem4 commented 4 years ago

Below is the piece of code:

`

def initialize_dictionary(data_string): """ This function will create an object of SpellCheker class and initialize/create a dictionary w.r.t to the data_string """ spell = SpellChecker(language=None, case_sensitive=True, distance=2) spell.word_frequency.load_text(data_string) for word in data_string.split(): if spell.word_probability(word) == 0.0: # here i am manually adding those words which are in

the data_string but not in the word frequency dictionary crated by spell checker

        spell.word_frequency.add(word)

return spell

`

data_string = """ listen tell me how is growth oil different from beard oil what about growth and beard oil hello just wanted to ask that is your growth oil same as beard oil? do yk what is difference between growth and beard oil i make muffins 😂 👏🏻 😡 😢 😜 😔 😒 🙌🏻 xd :') :'( :( :) """ spell = initialize_dictionary(data_string) spell.candidates(":)") # it will not give you :) in return rather closest match probably "i"

barrust commented 4 years ago

So I am confused what your issue is. When I use the following code I get the following:


def initialize_dictionary(data_string):
    """ This function will create an object of SpellCheker class and
        initialize/create a dictionary w.r.t to the data_string"""
    spell = SpellChecker(language=None, case_sensitive=True, distance=2)
    spell.word_frequency.load_text(data_string)
    for word in data_string.split():
        # print(word)
        if spell.word_probability(word) == 0.0: # here i am manually adding those words which are in
            # the data_string but not in the word frequency dictionary crated by spell checker
            spell.word_frequency.add(word)
    return spell

data_string = """ listen tell me how is growth oil different from beard oil
what about growth and beard oil
hello just wanted to ask that is your growth oil same as beard oil?
do yk what is difference between growth and beard oil
 i make muffins
😂
👏🏻
😡
😢
😜
😔
😒
🙌🏻
xd
:')
:'(
:(
:) """

spell = initialize_dictionary(data_string)
print(spell.candidates(":)"))    # Returns {':)'} which is what is expected!

I have tried it in python3 and python 2.7 and they both work as expected. So what is the issue?

NomanSaleem4 commented 4 years ago

This is my code that resolves the issue. I have incorporated a check of Word frequency occurrence if it is 0.0 that means spellchecker does not pick this word but actually it is present in the text. Here my check manually adds it in the word frequency dic. Point is why does spellchecker initially does not add :) :( in the word frequency dict.

On Sun, Jun 7, 2020, 3:13 AM Tyler Barrus notifications@github.com wrote:

So I am confused what your issue is. When I use the following code I get the following:

def initialize_dictionary(data_string):

""" This function will create an object of SpellCheker class and
    initialize/create a dictionary w.r.t to the data_string"""

spell = SpellChecker(language=None, case_sensitive=True, distance=2)

spell.word_frequency.load_text(data_string)

for word in data_string.split():

    # print(word)

    if spell.word_probability(word) == 0.0: # here i am manually adding those words which are in

        # the data_string but not in the word frequency dictionary crated by spell checker

        spell.word_frequency.add(word)

return spell

data_string = """ listen tell me how is growth oil different from beard oil what about growth and beard oil hello just wanted to ask that is your growth oil same as beard oil? do yk what is difference between growth and beard oil i make muffins 😂 👏🏻 😡 😢 😜 😔 😒 🙌🏻 xd :') :'( :( :) """

spell = initialize_dictionary(data_string) print(spell.candidates(":)")) # Returns {':)'} which is what is expected!

I have tried it in python3 and python 2.7 and they both work as expected. So what is the issue?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/barrust/pyspellchecker/issues/67#issuecomment-640124843, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEHPSCOS7NUT6ETKG6IRIJTRVK5QZANCNFSM4NOT76NA .

NomanSaleem4 commented 4 years ago

While creating a dict

On Sun, Jun 7, 2020, 3:52 AM Noman Saleem noman.saleem.4@gmail.com wrote:

This is my code that resolves the issue. I have incorporated a check of Word frequency occurrence if it is 0.0 that means spellchecker does not pick this word but actually it is present in the text. Here my check manually adds it in the word frequency dic. Point is why does spellchecker initially does not add :) :( in the word frequency dict.

On Sun, Jun 7, 2020, 3:13 AM Tyler Barrus notifications@github.com wrote:

So I am confused what your issue is. When I use the following code I get the following:

def initialize_dictionary(data_string):

""" This function will create an object of SpellCheker class and
    initialize/create a dictionary w.r.t to the data_string"""

spell = SpellChecker(language=None, case_sensitive=True, distance=2)

spell.word_frequency.load_text(data_string)

for word in data_string.split():

    # print(word)

    if spell.word_probability(word) == 0.0: # here i am manually adding those words which are in

        # the data_string but not in the word frequency dictionary crated by spell checker

        spell.word_frequency.add(word)

return spell

data_string = """ listen tell me how is growth oil different from beard oil what about growth and beard oil hello just wanted to ask that is your growth oil same as beard oil? do yk what is difference between growth and beard oil i make muffins 😂 👏🏻 😡 😢 😜 😔 😒 🙌🏻 xd :') :'( :( :) """

spell = initialize_dictionary(data_string) print(spell.candidates(":)")) # Returns {':)'} which is what is expected!

I have tried it in python3 and python 2.7 and they both work as expected. So what is the issue?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/barrust/pyspellchecker/issues/67#issuecomment-640124843, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEHPSCOS7NUT6ETKG6IRIJTRVK5QZANCNFSM4NOT76NA .

barrust commented 4 years ago

Ah, ok. That makes sense. The confusion is in the default tokenizer (the function to parse a string of text into words) is actually removing anything with punctuation.

return re.findall(r"\w+", text.lower())

You can pass your own tokenizer or use the string.split() function and pass the set of words directly into the dictionary:

spell.word_frequency.load_words(data_string.split())

Hope this helps!

NomanSaleem4 commented 4 years ago

OK thanks

On Sun, Jun 7, 2020, 5:08 AM Tyler Barrus notifications@github.com wrote:

Ah, ok. That makes sense. The confusion is in the default tokenizer (the function to parse a string of text into words) is actually removing anything with punctuation.

return re.findall(r"\w+", text.lower())

You can pass your own tokenizer or use the string.split() function and pass the set of words directly into the dictionary:

spell.word_frequency.load_words(data_string.split())

Hope this helps!

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/barrust/pyspellchecker/issues/67#issuecomment-640134862, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEHPSCLFIMMWAF35UAQWWLLRVLLARANCNFSM4NOT76NA .