Closed seanchrismurphy closed 5 years ago
Update: I've realised that this is because spacy's lemmatization is POS-dependent. When tagged with a space before it, it seems to assume bullying is a noun, and thus doesn't lemmatize it, but when it's tagged as a verb (which seems to be the default if it's on its own) is does get lemmatized. I suppose I can see why this behavior it intentional, but I wonder if it makes sense to have an option to just lemmatize a word as if it's appeared on its own, so you don't end up with a corpus with some versions lemmatized and others not?
I think the best workaround for this would be to just add individual lexical entries for those special hyphenated words.
Merging with #3052 (inaccurate prediction master thread)
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Not 100% sure if this is a feature or a bug, but this behavior seems undesirable. When nlp is applied to text containing hyphens, the text element correctly contains the separate tokens (i.e. cyber-bullying is broken into 'cyber', '-', and 'bullying'. However, the lemmas don't seem to match the individual tokens, almost as if the lemmas were based on the pre-split tokens and then split themselves.
For instance:
Info about spaCy