Closed xumx closed 6 years ago
There's currently no text normalization or spelling correction in spaCy. We'd like to get this built, though.
What would be the recommended approach?
I'm thinking first doing an nlp
parse without dependency, just tokenisation.
Then, to use some spell checker based on the vocab. Using ngram features would be great too, and also to allow the addition of an additional custom dictionary (or some way to give more weight to our own dictionary).
To actually auto correct, I guess to use something like https://github.com/gfairchild/pyxDamerauLevenshtein , where the distance allowed should be growing with the length of the token.
I might be missing something entirely here, but I've been trying to understand how Spacy treats misspellings in its lemmatization/tokenization. As near as I can tell, the behavior right now is to take misspelled words and insert them into the list, bumping all following tokens down. This was pretty confusing when word.lemma was returning different values, depending on whether or not my data contained misspellings.
For the work I'm doing, I don't want to correct the spellings, I just want to know that the misspellings are there and be able to extract them. From my end, a good first step might be to simply have misspellings/words not in the lemma lists be flagged as such in some way (optionally?). Am I totally out to lunch?
@lucasjfriesen For a simple temporary solution, I think you could just check whether the token is in
the nlp.vocab
.
Good thought @kootenpv - Thanks! I'll see what I can work with that.
Edit FWIW to anyone else reading this: is_oov
yields a bool asking "Is the word out-of-vocabulary?". Nice and easy.
Any update on this? IS there going to be a context-aware spell checker for Spacy? Ideally, we like to provide our own context (train dataset).
thank you
Quick update: This might be a nice use case for the new custom processing pipeline components and extension attributes introduced in v2.0!
Adding on this, Hunspell is the most used spell checker, and has a binding in python, that could be a good start : https://github.com/blatinier/pyhunspell
@pavillet Thanks, this is a great suggestion! Just had a look at the API and felt inspired, so here's some untested, semi-pseudocode for a possible spaCy component:
pyhunspell
import hunspell
from spacy.tokens import Token
class spaCyHunSpell(object):
name = 'spacy_hunspell'
def __init__(self, dic_path, aff_path):
self.hobj = hunspell.HunSpell(dic_path, aff_path)
Token.set_extension('hunspell_spell', default=None)
Token.set_extension('hunspell_suggest', getter=self.get_suggestion)
def __call__(self, doc):
for token in doc:
token._.hunspell_spell = self.hobj.spell(token.text)
return doc
def get_suggestion(self, token):
return self.hobj.suggest(token.text)
import spacy
nlp = spacy.load('en_core_web_sm')
hunspell = spaCyHunSpell('en_US.dic', 'en_US.aff')
nlp.add_pipe(hunspell)
doc = nlp(u"This is spookie")
assert [t._.hunspell_spell for t in doc] == [True, True, False]
suggestions = doc[2]._.hunspell_suggest
# ['spookier', 'spookiness', 'spook', 'cookie', 'bookie', 'Spokane', 'spoken']
pirate/spellchecker
: A spell-checker extending Peter Norvig's with multi-typo correction, hamming distance weighting, and more. The relevant docs if anyone wants to take this on and build an extension package – would also be a great project for spaCy beginners!
Took a stab at it here: https://github.com/tokestermw/spacy_hunspell
Hardest part was installing hunspell
since the pseudocode is correct :)
@tokestermw Ah, this is really cool – can't wait to try it! Also, let me know if/when it's ready to be shared, so we can post it on Twitter and add it to the extensions on the resources page.
@ines I think it's mostly ready: https://github.com/tokestermw/spacy_hunspell/releases
Haven't thoroughly tested for various platforms and the installation may need some work but the plugin itself is straightforward.
I have a couple other ideas for plugins so will be working on that too.
👍
Just added it to the resources and shared it on Twitter 🎉 Will close this issue, since there's now a plugin and other ideas and suggestions further up in the thread.
Of course, this doesn't mean there can't be more than one spell checker for spaCy 😉 So if anyone was going to build their own, feel free to share it – it'd definitely be a great addition to our (still very small) collection of community plugins!
https://github.com/atpaino/deep-text-corrector may be helpful.
Best regards !
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Does spacy use any text normalizer to resolve spelling errors? Is there any plans for it? Or do I need a separate step before passing the text string to spaCy?