explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.99k stars 4.39k forks source link

Help with lemmatization, different results #3644

Closed userFT closed 5 years ago

userFT commented 5 years ago

I'm currently using spaCy on Python. The model used is en-core-web-sm (2.1.0).

The following code is run to retrieve a list of words "cleansed" from a query

import spacy nlp = spacy.load("en_core_web_sm") doc = nlp(query) list_words = [] for token in doc: if token.text != ' ': listwords.append(token.lemma)

However I face a major issue, when running this code. For example, when the query is "processing of tea leaves". The result stored in list_words can be either ['processing', 'tea', 'leaf'] or ['processing', 'tea', 'leave'].

It seems that the result is not consistent. I cannot change my input/query (adding another word for context is not possible) and I really need to find the same result every time. I think the loading of the model may be the issue.

Why the result differ ? Can I load the model the "same" way everytime ? Did I miss a parameter to obtain the same result for ambiguous query ?

Thanks for your help

DuyguA commented 5 years ago

Can you check the POS-tags from such sentences from the input? Are the sentences are correctly tagged?

userFT commented 5 years ago

Hi @DuyguA, thank you very much for your answer. In both cases - for "processing of tea leaves" - I got the following POS-tags : ['NN', 'NN', 'NNS'] (using token.tag_). ['processing', 'tea', 'leaf'] => ['NN', 'NN', 'NNS'] ['processing', 'tea', 'leave'] => ['NN', 'NN', 'NNS']

It seems that the sentence is correctly tagged. I'm fine with any of those two results, I just want to be able to consistently "hit" the same result, either 'leaf' or 'leave'. (not sure if I made myself understandable).

BramVanroy commented 5 years ago

This seems to be the same as https://github.com/explosion/spaCy/issues/3484 and is fixed in PR https://github.com/explosion/spaCy/pull/3646.

userFT commented 5 years ago

Looks like it's working for me. Thanks a lot! When will it be added to the next release ?

BramVanroy commented 5 years ago

If this completely solved your issue, please close this topic so that we can focus our attention on open issues.

lock[bot] commented 5 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.