Closed David-Baron closed 1 year ago
Something like:
class Speller:
def __init__(
self, lang="en", threshold=0, nlp_data=None, fast=False, only_replacements=False, ignore=[]
):
self.lang = lang
self.threshold = threshold
self.nlp_data = load_from_tar(lang) if nlp_data is None else nlp_data
self.fast = fast
self.only_replacements = only_replacements
self.ignore = ignore
if threshold > 0:
# print(f'Original number of words: {len(self.nlp_data)}')
self.nlp_data = {k: v for k, v in self.nlp_data.items() if v >= threshold}
# print(f'After applying threshold: {len(self.nlp_data)}')
def existing(self, words):
"""{'the', 'teh'} => {'the'}"""
return {word for word in words if word in self.nlp_data}
def get_candidates(self, word):
w = Word(word, self.lang, self.only_replacements)
if self.fast:
candidates = self.existing([word]) or self.existing(w.typos()) or [word]
else:
candidates = (
self.existing([word])
or self.existing(w.typos())
or self.existing(w.double_typos())
or [word]
)
return [(self.nlp_data.get(c, 0), c) for c in candidates]
def autocorrect_word(self, word):
"""most likely correction for everything up to a double typo"""
if word == "":
return ""
# ignore
if word in self.ignore:
return word
candidates = self.get_candidates(word)
# in case the word is capitalized
if word[0].isupper():
decapitalized = word[0].lower() + word[1:]
candidates += self.get_candidates(decapitalized)
best_word = max(candidates)[1]
if word[0].isupper():
best_word = best_word[0].upper() + best_word[1:]
return best_word
def autocorrect_sentence(self, sentence):
return re.sub(
word_regexes[self.lang],
lambda match: self.autocorrect_word(match.group(0)),
sentence,
)
__call__ = autocorrect_sentence
It's running but no test availlable (I don't know how to write it and no time actualy).
If a developer more qualified than me can do the unit test and do a PR. Thank you.
Yeah, the code looks legit, but I also don't have the time to do the test and all.
Thanks for figuring this out. I'm leaving the issue open in case someone else has a similar use case.
Ah I remembered there actually is a way to ignore words already (although a bit roundabout).
The nlp_data
parameter lets you pass your own word frequency dictionary. If you want to use the default dictionary, but just ignore a few words, you can modify that dictionary. You must set some non-zero frequency to the words you wish to ignore. (I know, this is quite a hacky way to do it, your implementation is cleaner.)
@filyp Yes indeed, but I find the way a little too boring because the goal and for only a few words. In addition, it modifies the base file, which is wrong when using autocorrect in several projects.
I didn't mean modifying files but something like:
spell = Speller()
spell.nlp_data.update(words_to_ignore_dict)
It's a possibility, the difference and that you have to subtract the ignored ones from the nlp array, longer processing time I think. (not tested)
hi, can u please help me in adding some words ,as below words are not spell checked correctly,also i have tried add those updated words as per ur above code ,it is not working ,so can u add some words in your vocabulary .Or can u suggest some tested method to add these words.
metaverse kiyaverse metachamber metaroom
Thanks
@charlietiwari
I don't think it's about this issue.
In addition, kiyaverse metachamber metaroom are names specific to companies, so it seems to me that they will never be added to a dictionary. Metavarse when it will certainly soon be seen as it enters more and more into the common language of certain languages. Look in the readme https://github.com/filyp/autocorrect#custom-word-sets for the correction of your particular words.
Please create a new issue if the question does not match the current issue.
@David-Baron ah, no, you don't subtract but rather add them to this nlp_data. This way, they are treated as real words and not corrected. Re performance, modifying a dictionary is pretty efficient - it has complexity O(n) where n is the number of new entries (here, words). And you just do it once, during initialization of Speller. In later usage, there should be no increase in processing time, because dictionary lookup time doesn't depend on the number of items in the dictionary.
@filyp You are right, this works like a charm!
Hello. I can't find a function to ignore a word. In some cases we need it. An example in the English dictionary: srai -> sry but I have to ignore it.