explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.65k stars 4.36k forks source link

Lemmatization inconsistent with hyphenated words #2449

Closed seanchrismurphy closed 5 years ago

seanchrismurphy commented 6 years ago

Not 100% sure if this is a feature or a bug, but this behavior seems undesirable. When nlp is applied to text containing hyphens, the text element correctly contains the separate tokens (i.e. cyber-bullying is broken into 'cyber', '-', and 'bullying'. However, the lemmas don't seem to match the individual tokens, almost as if the lemmas were based on the pre-split tokens and then split themselves.

For instance:

# This correctly lemmatizes to 'bully'
nlp('bullying')[0].lemma_

# This correctly says the text is "bullying"
nlp('cyber-bullying')[2].text

# But this incorrectly says the accompanying lemma is still "bullying"
nlp('cyber-bullying')[2].lemma_

Info about spaCy

seanchrismurphy commented 6 years ago

Update: I've realised that this is because spacy's lemmatization is POS-dependent. When tagged with a space before it, it seems to assume bullying is a noun, and thus doesn't lemmatize it, but when it's tagged as a verb (which seems to be the default if it's on its own) is does get lemmatized. I suppose I can see why this behavior it intentional, but I wonder if it makes sense to have an option to just lemmatize a word as if it's appeared on its own, so you don't end up with a corpus with some versions lemmatized and others not?

dataanalyst4lyfe commented 6 years ago

I think the best workaround for this would be to just add individual lexical entries for those special hyphenated words.

honnibal commented 5 years ago

Merging with #3052 (inaccurate prediction master thread)

lock[bot] commented 5 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.