explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.82k stars 4.37k forks source link

POS Tagging is Broken for Sliced Pipelines #13222

Closed lordsoffallen closed 9 months ago

lordsoffallen commented 9 months ago

Hey everyone,

I'm trying to lemmatize a text which I cleaned earlier. The issue I had was due to runtime so I decided to cut down certain pipelines out since I wanted lemmas only. When I only enable lemmas I got some warnings but I also wanted to filter based on POS tags such as ['ADJ', 'NOUN', 'VERB', 'ADV']. In order to generate .pos_ attribute, I enabled pipeline components for that which documentation said tagger and parser. However using those only doesn'y really work here as I am not getting expected POS tags. When I use the full pipeline I get expected results but not when I use certain pipelines. Is this behaviour expected? If so, why? How do I know which pipelines to exclude as I am a bit of confused now.

Thanks in advance!

How to reproduce the behaviour

Here is the code sample that doesn't work:

nlp = spacy.load('en_core_web_sm', enable=['lemmatizer', 'tagger', "parser", "attribute_ruler"])

text = """
If you like the taste of Sweet Low get this If you don t don t Couldn t get through one cup of coffee 
I m gonna give Stevia Extract in the Raw a try It s made by the folks at Sugar in the Raw Here s 
what they claim Stevia Extract In The Raw gets its delicious natural sweetness from Rebiana an 
extract from the Stevia plant This extract is the sweetest part of the plant and has recently
been isolated to provide pure sweetening power without the licorice like aftertaste that many 
of our predecessors exhibited All you get is the sweet flavor without any calories 
We ll see Simply Stevia is simply nasty
"""

print([t.pos_ for t in nlp(text)])

The one that works:

nlp = spacy.load('en_core_web_sm')
print([t.pos_ for t in nlp(text)])

image

Your Environment

svlandeg commented 9 months ago

Hi! Let me convert this to a thread on the discussion forum and answer you there!