Hello, I am using pyspellchecker (French dictionary) in order to correct this file.

However, after running the script below, I have noticed two types of problems (output file):

certain tokens (initially correct) were modified incorrectly (e.g. collection is corrected to colletion)
some tokens (initially incorrect) were not corrected properly (e.g. MÉMOÎRES should be corrected to MÉMOIRES):
```
import re, glob
from spellchecker import SpellChecker
```

entry = "6228000_r.txt" output = "6228000_r_corr.txt"

spell = SpellChecker(language='fr') text = open(entry).read()

do not correct the tokens containing the apostrophes (ex : l’empire, d’art, s’étend...)

r1 = re.findall(r"([lL]’\w+|[dD]’\w+|[sS]’\w+|[qQ]u’\w+|[cC]’\w+|[nN]’\w+|[jJ]’\w+|[Ll]orfqu’\w+|eft)",text)

tokenise the text with the pyspellchecker tokeniser

tokens = spell.split_words(text)

spell.word_frequency.load_words(r1) spell.known(r1) # the words l’empire, d’art, s’étend etc. are now in the dictionary of known words

print(tokens) misspelled = spell.unknown(tokens)

with open(output, "w") as f: for m in misspelled: corrected = spell.correction(m) text = text.replace(m, corrected)

f.write(c.replace('clafliques', 'classiques'))

f.write(text)



I cleaned up the original `.txt` file by replacing the single quote (`'`) with the apostrophe (`’`) in the words such as `l’empire`. 
I also tried to remove some other special characters (e.g. `^`, `&`, `<`, `>`), but the errors persist, and I cannot seem to locate exactly what causes them.

Do you have any idea how to resolve this issue?

barrust / pyspellchecker

Pyspellchecker corrects tokens that are not supposed to be corrected, and does not correct well some incorrect ones #107

do not correct the tokens containing the apostrophes (ex : l’empire, d’art, s’étend...)

tokenise the text with the pyspellchecker tokeniser

f.write(c.replace('clafliques', 'classiques'))