explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
30.22k stars 4.4k forks source link

Spell checker/corrector? #315

Closed xumx closed 6 years ago

xumx commented 8 years ago

Does spacy use any text normalizer to resolve spelling errors? Is there any plans for it? Or do I need a separate step before passing the text string to spaCy?

honnibal commented 8 years ago

There's currently no text normalization or spelling correction in spaCy. We'd like to get this built, though.

kootenpv commented 7 years ago

What would be the recommended approach?

I'm thinking first doing an nlp parse without dependency, just tokenisation.

Then, to use some spell checker based on the vocab. Using ngram features would be great too, and also to allow the addition of an additional custom dictionary (or some way to give more weight to our own dictionary).

To actually auto correct, I guess to use something like https://github.com/gfairchild/pyxDamerauLevenshtein , where the distance allowed should be growing with the length of the token.

lucasjfriesen commented 7 years ago

I might be missing something entirely here, but I've been trying to understand how Spacy treats misspellings in its lemmatization/tokenization. As near as I can tell, the behavior right now is to take misspelled words and insert them into the list, bumping all following tokens down. This was pretty confusing when word.lemma was returning different values, depending on whether or not my data contained misspellings.

For the work I'm doing, I don't want to correct the spellings, I just want to know that the misspellings are there and be able to extract them. From my end, a good first step might be to simply have misspellings/words not in the lemma lists be flagged as such in some way (optionally?). Am I totally out to lunch?

kootenpv commented 7 years ago

@lucasjfriesen For a simple temporary solution, I think you could just check whether the token is in the nlp.vocab.

lucasjfriesen commented 7 years ago

Good thought @kootenpv - Thanks! I'll see what I can work with that.

Edit FWIW to anyone else reading this: is_oov yields a bool asking "Is the word out-of-vocabulary?". Nice and easy.

casraz commented 7 years ago

Any update on this? IS there going to be a context-aware spell checker for Spacy? Ideally, we like to provide our own context (train dataset).

thank you

ines commented 7 years ago

Quick update: This might be a nice use case for the new custom processing pipeline components and extension attributes introduced in v2.0!

pavillet commented 7 years ago

Adding on this, Hunspell is the most used spell checker, and has a binding in python, that could be a good start : https://github.com/blatinier/pyhunspell

ines commented 6 years ago

@pavillet Thanks, this is a great suggestion! Just had a look at the API and felt inspired, so here's some untested, semi-pseudocode for a possible spaCy component:

Example using pyhunspell

import hunspell
from spacy.tokens import Token

class spaCyHunSpell(object):
    name = 'spacy_hunspell'

    def __init__(self, dic_path, aff_path):
        self.hobj = hunspell.HunSpell(dic_path, aff_path)
        Token.set_extension('hunspell_spell', default=None)
        Token.set_extension('hunspell_suggest', getter=self.get_suggestion)

    def __call__(self, doc):
        for token in doc:
            token._.hunspell_spell = self.hobj.spell(token.text)
        return doc

    def get_suggestion(self, token):
        return self.hobj.suggest(token.text)
import spacy

nlp = spacy.load('en_core_web_sm')
hunspell = spaCyHunSpell('en_US.dic', 'en_US.aff')
nlp.add_pipe(hunspell)

doc = nlp(u"This is spookie")
assert [t._.hunspell_spell for t in doc] == [True, True, False]
suggestions = doc[2]._.hunspell_suggest
# ['spookier', 'spookiness', 'spook', 'cookie', 'bookie', 'Spokane', 'spoken']

Alternative ideas and inspiration

Relevant spaCy documentation

The relevant docs if anyone wants to take this on and build an extension package – would also be a great project for spaCy beginners!

tokestermw commented 6 years ago

Took a stab at it here: https://github.com/tokestermw/spacy_hunspell

Hardest part was installing hunspell since the pseudocode is correct :)

ines commented 6 years ago

@tokestermw Ah, this is really cool – can't wait to try it! Also, let me know if/when it's ready to be shared, so we can post it on Twitter and add it to the extensions on the resources page.

tokestermw commented 6 years ago

@ines I think it's mostly ready: https://github.com/tokestermw/spacy_hunspell/releases

Haven't thoroughly tested for various platforms and the installation may need some work but the plugin itself is straightforward.

I have a couple other ideas for plugins so will be working on that too.

👍

ines commented 6 years ago

Just added it to the resources and shared it on Twitter 🎉 Will close this issue, since there's now a plugin and other ideas and suggestions further up in the thread.

Of course, this doesn't mean there can't be more than one spell checker for spaCy 😉 So if anyone was going to build their own, feel free to share it – it'd definitely be a great addition to our (still very small) collection of community plugins!

ufukhurriyetoglu commented 6 years ago

https://github.com/atpaino/deep-text-corrector may be helpful.

Best regards !

lock[bot] commented 6 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.