barrust / pyspellchecker

Pure Python Spell Checking http://pyspellchecker.readthedocs.io/en/latest/
MIT License
694 stars 101 forks source link

Pyspellchecker corrects tokens that are not supposed to be corrected, and does not correct well some incorrect ones #107

Closed ljpetkovic closed 8 months ago

ljpetkovic commented 3 years ago

Hello, I am using pyspellchecker (French dictionary) in order to correct this file.

However, after running the script below, I have noticed two types of problems (output file):

  1. certain tokens (initially correct) were modified incorrectly (e.g. collection is corrected to colletion)
  2. some tokens (initially incorrect) were not corrected properly (e.g. MÉMOÎRES should be corrected to MÉMOIRES):
    
    import re, glob
    from spellchecker import SpellChecker

entry = "6228000_r.txt" output = "6228000_r_corr.txt"

spell = SpellChecker(language='fr') text = open(entry).read()

do not correct the tokens containing the apostrophes (ex : l’empire, d’art, s’étend...)

r1 = re.findall(r"([lL]’\w+|[dD]’\w+|[sS]’\w+|[qQ]u’\w+|[cC]’\w+|[nN]’\w+|[jJ]’\w+|[Ll]orfqu’\w+|eft)",text)

tokenise the text with the pyspellchecker tokeniser

tokens = spell.split_words(text)

spell.word_frequency.load_words(r1) spell.known(r1) # the words l’empire, d’art, s’étend etc. are now in the dictionary of known words

print(tokens) misspelled = spell.unknown(tokens)

with open(output, "w") as f: for m in misspelled: corrected = spell.correction(m) text = text.replace(m, corrected)

f.write(c.replace('clafliques', 'classiques'))

f.write(text)


I cleaned up the original `.txt` file by replacing the single quote (`'`) with the apostrophe (`’`) in the words such as `l’empire`. 
I also tried to remove some other special characters (e.g. `^`, `&`, `<`, `>`), but the errors persist, and I cannot seem to locate exactly what causes them.

Do you have any idea how to resolve this issue?
barrust commented 3 years ago

It is likely an issue with the dictionary based on the data source used to build the dictionaries. I am not a French speaker and am not really able to validate the data in the dictionary. You can see the script used to build the dictionary here and there is a discussion on how it is done and how it could be improved in this discussion.

Any help on updating the dictionary would be helpful.