filyp / autocorrect

Spelling corrector in python
GNU Lesser General Public License v3.0
456 stars 98 forks source link

Ignore word function #53

Closed David-Baron closed 1 year ago

David-Baron commented 1 year ago

Hello. I can't find a function to ignore a word. In some cases we need it. An example in the English dictionary: srai -> sry but I have to ignore it.

David-Baron commented 1 year ago

Something like:

class Speller:
    def __init__(
        self, lang="en", threshold=0, nlp_data=None, fast=False, only_replacements=False, ignore=[]
    ):
        self.lang = lang
        self.threshold = threshold
        self.nlp_data = load_from_tar(lang) if nlp_data is None else nlp_data
        self.fast = fast
        self.only_replacements = only_replacements
        self.ignore = ignore

        if threshold > 0:
            # print(f'Original number of words: {len(self.nlp_data)}')
            self.nlp_data = {k: v for k, v in self.nlp_data.items() if v >= threshold}
            # print(f'After applying threshold: {len(self.nlp_data)}')

    def existing(self, words):
        """{'the', 'teh'} => {'the'}"""
        return {word for word in words if word in self.nlp_data}

    def get_candidates(self, word):
        w = Word(word, self.lang, self.only_replacements)
        if self.fast:
            candidates = self.existing([word]) or self.existing(w.typos()) or [word]
        else:
            candidates = (
                self.existing([word])
                or self.existing(w.typos())
                or self.existing(w.double_typos())
                or [word]
            )
        return [(self.nlp_data.get(c, 0), c) for c in candidates]

    def autocorrect_word(self, word):
        """most likely correction for everything up to a double typo"""
        if word == "":
            return ""

        # ignore
        if word in self.ignore:
            return word

        candidates = self.get_candidates(word)

        # in case the word is capitalized
        if word[0].isupper():
            decapitalized = word[0].lower() + word[1:]
            candidates += self.get_candidates(decapitalized)

        best_word = max(candidates)[1]

        if word[0].isupper():
            best_word = best_word[0].upper() + best_word[1:]
        return best_word

    def autocorrect_sentence(self, sentence):
        return re.sub(
            word_regexes[self.lang],
            lambda match: self.autocorrect_word(match.group(0)),
            sentence,
        )

    __call__ = autocorrect_sentence

It's running but no test availlable (I don't know how to write it and no time actualy).

If a developer more qualified than me can do the unit test and do a PR. Thank you.

filyp commented 1 year ago

Yeah, the code looks legit, but I also don't have the time to do the test and all.

Thanks for figuring this out. I'm leaving the issue open in case someone else has a similar use case.

filyp commented 1 year ago

Ah I remembered there actually is a way to ignore words already (although a bit roundabout).

The nlp_data parameter lets you pass your own word frequency dictionary. If you want to use the default dictionary, but just ignore a few words, you can modify that dictionary. You must set some non-zero frequency to the words you wish to ignore. (I know, this is quite a hacky way to do it, your implementation is cleaner.)

David-Baron commented 1 year ago

@filyp Yes indeed, but I find the way a little too boring because the goal and for only a few words. In addition, it modifies the base file, which is wrong when using autocorrect in several projects.

filyp commented 1 year ago

I didn't mean modifying files but something like:

spell = Speller()
spell.nlp_data.update(words_to_ignore_dict)
David-Baron commented 1 year ago

It's a possibility, the difference and that you have to subtract the ignored ones from the nlp array, longer processing time I think. (not tested)

charlietiwari commented 1 year ago

hi, can u please help me in adding some words ,as below words are not spell checked correctly,also i have tried add those updated words as per ur above code ,it is not working ,so can u add some words in your vocabulary .Or can u suggest some tested method to add these words.

metaverse kiyaverse metachamber metaroom

Thanks

David-Baron commented 1 year ago

@charlietiwari
I don't think it's about this issue.

In addition, kiyaverse metachamber metaroom are names specific to companies, so it seems to me that they will never be added to a dictionary. Metavarse when it will certainly soon be seen as it enters more and more into the common language of certain languages. Look in the readme https://github.com/filyp/autocorrect#custom-word-sets for the correction of your particular words.

Please create a new issue if the question does not match the current issue.

filyp commented 1 year ago

@David-Baron ah, no, you don't subtract but rather add them to this nlp_data. This way, they are treated as real words and not corrected. Re performance, modifying a dictionary is pretty efficient - it has complexity O(n) where n is the number of new entries (here, words). And you just do it once, during initialization of Speller. In later usage, there should be no increase in processing time, because dictionary lookup time doesn't depend on the number of items in the dictionary.

David-Baron commented 1 year ago

@filyp You are right, this works like a charm!