Closed ljpetkovic closed 11 months ago
It is likely an issue with the dictionary based on the data source used to build the dictionaries. I am not a French speaker and am not really able to validate the data in the dictionary. You can see the script used to build the dictionary here and there is a discussion on how it is done and how it could be improved in this discussion.
Any help on updating the dictionary would be helpful.
Hello, I am using
pyspellchecker
(French dictionary) in order to correct this file.However, after running the script below, I have noticed two types of problems (output file):
collection
is corrected tocolletion
)MÉMOÎRES
should be corrected toMÉMOIRES
):entry = "6228000_r.txt" output = "6228000_r_corr.txt"
spell = SpellChecker(language='fr') text = open(entry).read()
do not correct the tokens containing the apostrophes (ex : l’empire, d’art, s’étend...)
r1 = re.findall(r"([lL]’\w+|[dD]’\w+|[sS]’\w+|[qQ]u’\w+|[cC]’\w+|[nN]’\w+|[jJ]’\w+|[Ll]orfqu’\w+|eft)",text)
tokenise the text with the pyspellchecker tokeniser
tokens = spell.split_words(text)
spell.word_frequency.load_words(r1) spell.known(r1) # the words l’empire, d’art, s’étend etc. are now in the dictionary of known words
print(tokens) misspelled = spell.unknown(tokens)
with open(output, "w") as f: for m in misspelled: corrected = spell.correction(m) text = text.replace(m, corrected)
f.write(c.replace('clafliques', 'classiques'))