barrust / pyspellchecker

Pure Python Spell Checking http://pyspellchecker.readthedocs.io/en/latest/
MIT License
694 stars 101 forks source link

Wrong Corrections, Ignores High Frequency Words #68

Closed Fatima-Gh closed 3 years ago

Fatima-Gh commented 4 years ago

Hi,

I built a custom Arabic dictionary following FrequencyWords and the docs instructions, there is a variation in the words frequency as my corpus is quite small, the issue I have is that the spell checker won't choose the word with the highest frequency, instead it picks words with much lesser frequency, I'm not sure why but it seems that it always chooses the same corrections even after I increased the word frequency of the important words to increase the gap, additionally , these wrong corrections have the same distance of these high frequency words.

barrust commented 4 years ago

@Fatima-Gh I am sorry you are having this issue. Can you provide a code sample that shows the issue that you are seeing?

MartenBE commented 3 years ago

Dear @barrust,

I have the same issue with the Dutch language:

#!/usr/bin/env python3

from spellchecker import SpellChecker

MAIL = """

Yo professor,

Wanner ist dade examn?

Mvg,
"""

print("-" * 80)
print(MAIL)
print("-" * 80)

errors = []

### Spellingscontrole

spell = SpellChecker()
spell.word_frequency.load_dictionary('./nl.json')

words = spell.split_words(MAIL)
print(words)

for word in words:
    probability = spell.word_probability(word)
    print(f"\"{word}\" has probability {probability}")
    if probability < 0.75:
        correction = spell.correction(word)
        candidates = spell.candidates(word)
        errors.append(f"\"{word}\" is flagged as misspelled. Did you mean \"{correction}\"? Other possible candidates: {candidates}")

print()

if errors:
    print("Detected errors:")

    for error in errors:
        print("*", error)
else:
    print("No detected errors")
--------------------------------------------------------------------------------

Yo professor,

Wanner ist dade examn?

Mvg,

--------------------------------------------------------------------------------
['yo', 'professor', 'wanner', 'ist', 'dade', 'examn', 'mvg', 'martijn']
"yo" has probability 1.3775971557763172e-05
"professor" has probability 5.475906982548897e-05
"wanner" has probability 5.72840157630039e-07
"ist" has probability 4.782937238464404e-07
"dade" has probability 6.618250364851907e-07
"examn" has probability 0.0
"mvg" has probability 1.1123109856893962e-08

Detected errors:
* "yo" is flagged as misspelled. Did you mean "yo"? Other possible candidates: {'yo'}
* "professor" is flagged as misspelled. Did you mean "professor"? Other possible candidates: {'professor'}
* "wanner" is flagged as misspelled. Did you mean "wanner"? Other possible candidates: {'wanner'}
* "ist" is flagged as misspelled. Did you mean "ist"? Other possible candidates: {'ist'}
* "dade" is flagged as misspelled. Did you mean "dade"? Other possible candidates: {'dade'}
* "examn" is flagged as misspelled. Did you mean "examen"? Other possible candidates: {'exam', 'examen', 'exman', 'exams'}
* "mvg" is flagged as misspelled. Did you mean "mvg"? Other possible candidates: {'mvg'}

I used the following list https://github.com/hermitdave/FrequencyWords/blob/master/content/2018/nl/nl_full.txt which I coverted to json in nl.json.

Especially the word "wanner" I would have expected to be corrected to "wanneer", as "wanner" is a virtual non-existing word in Dutch and "wanneer" is highly used. Also, "professor" is also a well known frequently used word.

barrust commented 3 years ago

Interesting. Can you try something for me? The way you setup the SpellChecker object means that you are loading the English dictionary along with the Dutch dictionary. Can you see if this solves your issue?

from spellchecker import SpellChecker
spell = SpellChecker(language=None)
spell.word_frequency.load_dictionary('./nl.json')

# continue with your code checking

If that doesn't solve the issue, then I would like to see how you built the dictionary so that I can replicate the file.

Thanks!

barrust commented 3 years ago

Also, just because something has a low probability (in your case <0.75), doesn't mean that it is misspelled. In other words, if the value is in the dictionary and that is how it was spelled means that it is correct. This library does not attempt to see if the word used makes sense or is grammatically correct.

MartenBE commented 3 years ago

I think the file I generated from https://github.com/hermitdave/FrequencyWords/blob/master/content/2018/nl/nl_full.txt seems to have strange entries. I speak Dutch natively and some words I have never heard of. I begin to think that Dutch subs are not the best quality :p . I should crossreference the list with a dictionary list, but such dictionary lists aren't that easy to find it seems ...

I used the following script to generate the json-file:

#!/usr/bin/env python3

INPUT_FILENAME = "nl_full.txt"
OUTPUT_FILENAME = "nl_full.json"

with open(INPUT_FILENAME) as input:
    output = open(OUTPUT_FILENAME, "w")

    print("{", file=output)
    lines = input.readlines()
    for i, line in enumerate(lines):
        (word, frequency) = line.split()

        print(f"    \"{word}\": {frequency}", end="", file=output)
        if (i < (len(lines) - 1)):
            print(",", file=output)
        else:
            print("", file=output)

    print("}", file=output, end="")
barrust commented 3 years ago

@Fatima-Gh and @MartenBE

I hope you are both doing well. I have recently updated the "supported" dictionaries and wanted to see if you were ever able to resolve your issues here.

There are a few things that I have noticed that I wanted to make sure are clarified. Below is the basic code to generate a local dictionary from the FrequencyWords Repo

word_frequency = dict()

with open(INPUT, "r") as f:
    for line in f:
        parts = line.split()
        word_frequency[parts[0]] = int(parts[1].strip())

with open(OUTPUT, 'w') as f:
    json.dump(word_frequency, f, indent="", sort_keys=True, ensure_ascii=False)

To then load the new dictionary you would want to do the following as to not load the default, English dictionary and your local dictionary:

from spellchecker import SpellChecker

spell = SpellChecker(language=None, local_dictionary=OUTPUT)

I have noticed many errors with the FrequencyWords dataset. There are things you could try to clean up the dictionary but subtitles in any language (especially open source) are not always the best!

Please let me know if there is anything else that can be done to help with this issue. Otherwise, I am going to close it out. Thanks!

Fatima-Gh commented 3 years ago

@barrust

I will definitely try this solution by the weekend and get back to you.

barrust commented 3 years ago

@Fatima-Gh and @MartenBE any updates? Hope I am able to help! I am planning on closing this issue if no updates. If you are still seeing the issue, please let me know!

MartenBE commented 3 years ago

My apologies for my late reply. There where indeed errors in the data set, which I resolved my fetching my data elsewhere. So the problem is not within this software.

barrust commented 3 years ago

Sounds great! Thank you for letting me know! Good luck!