explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
30.17k stars 4.4k forks source link

Punctuation interfering with Named entities extraction #2157

Closed avissens closed 6 years ago

avissens commented 6 years ago

I have some new issues with spaCy missing Person entities. Below are some examples when it seems like spaCy is going blind when the name is preceded with certain titles or punctuation. I reported a similar issue before and it was closed as fixed after a while. But I still find more and more inconsistencies. I commented #YES & #NO to indicate a different outcome with slight changes in punctuation or even names.

#text = u"In a recent tweet, PM Luciana Berger sought clarification..." #NO
#text = u"In a recent tweet, Labour MP James Mill sought clarification..." #YES
#text = u"In a recent tweet, Luciana Berger sought clarification..." #YES
#text = u'In a statement, the acting CEO of Cambridge Analytica, Dr Alexander Tayler, said' #NO
#text = u'In a statement, the acting CEO of Cambridge Analytica, Alexander Tayler, said' #YES
#text = u"The EU has recalled its ambassador to Russia, German Markus Ederer, for consultations" #NO
#text = u"The EU has recalled its ambassador to Russia German Markus Ederer for consultations" #YES
#text = u"The EU has recalled its ambassador to Russia, Markus Ederer, for consultations" #YES
#text = u"While she accepted her son needed some extra support, Ben's mum Beverly Gleeson, told the BBC" #NO
#text = u"While she accepted her son needed some extra support, Ben's mum Beverly Gleeson, told the BBC" #YES
#text = u"victims after the British. Will Kerr NCA director said" #Picks only 'Kerr"
#text = u"Will Kerr, NCA director, said that" #Picks "Will Kerr'
#text = u"Victoria Atkins, Home Office minister" #YES
#text = u"Victoria Atkins Home Office minister." #NO - Removing commas - missing name
#text = u"some extra support, Ben's mum, Beverly Gleeson, told" #NO

all_tags = nlp(text)

person_list=[]
for ent in all_tags.ents:
    if ent.label_=="PERSON":
        person_list.append(str(ent))      
print person_list

I tried to remove punctuation but it affected other names (example above). So no workaround... I'm using 1.8.2 version. But before updating to 2.0 I really would like to make sure that these are fixed there as my first attempt to migrate to 2.0 failed. Could you please advise?

macOS Sierra (10.12.3) Python 2.7 spaCy-1.8.2 Jupiter notebook

honnibal commented 6 years ago

The v2.0 NER is generally better. You can try the demo here: https://demos.explosion.ai

I don't really know what to tell you in general though --- context and vocabulary item both matter, so having the NER be sensitive to punctuation, titles etc is expected behaviour.

There's also no such thing as "fixed" here, at least not in the sense I think you're asking for. We might update a model and all the cases you're testing might go from green to red, even if our statistics show the model generally has improved. That's also within expected behaviour --- it's a statistical model. You might prefer creating a rule-based system if you want finer-grained control and more predictability. The Matcher class is useful for that.

We also have an annotation tool, https://prodi.gy , to help people fine-tune the models on their data. This isn't a silver-bullet either though: there's no way to guarantee how much or in what way you'll need to annotate data to improve the model's performance.

avissens commented 6 years ago

Thank you, yes, I'm trying to add more and more rules. But when it comes to names themselves, I'm not sure I even understand the behaviour. Why it picks James but not Luciana in the same context? I'd like to hear your thoughts.

a-

On 28 Mar 2018, at 22:26, Matthew Honnibal notifications@github.com<mailto:notifications@github.com> wrote:

The v2.0 NER is generally better. You can try the demo here: https://demos.explosion.ai

I don't really know what to tell you in general though --- context and vocabulary item both matter, so having the NER be sensitive to punctuation, titles etc is expected behaviour.

There's also no such thing as "fixed" here. We might update a model and all the cases you're testing might go from green to red, even if our statistics show the model generally has improved. That's also within expected behaviour --- it's a statistical model. You might prefer creating a rule-based system if you want finer-grained control and more predictability. The Matcher class is useful for that.

We also have an annotation tool, https://prodi.gy , to help people fine-tune the models on their data. This isn't a silver-bullet either though: there's no way to guarantee how much or in what way you'll need to annotate data to improve the model's performance.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/explosion/spaCy/issues/2157#issuecomment-377042582, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ALYINLz-yLV85bGdC-B58sVPu_T5gLwjks5tjAAKgaJpZM4S-8hT.

ines commented 6 years ago

Why it picks James but not Luciana in the same context? I'd like to hear your thoughts.

There's no clear answer for that – the model is statistical, so it's not saying "XY is a person". It's saying that based on the context, it's very likely that "XY" should have the label PERSON, based on what the model has seen so far.

How the model performs always depends on the data it was trained on. spaCy's English models were trained on a general-purpose corpus of news and web text, which gives pretty good accuracy overall for most regular text types. But it can still make mistakes when you feed it your own, very specific data that might be different from what the model has seen during training.

That's why it's important to always fine-tune the model on your own data. You can do this by adding rules or by updating the entity recognizer with more annotations and labelled examples.

avissens commented 6 years ago

Hi both, I am sorry for insisting on this... But after trying to tune the model for weeks now I think I have identified one problem which is persistent in both 1.8 and 2.0 versions. Basically, if a Person entity comes after a word with a capital letter spacy doesn't recognize it as such. First I thought that punctuation plays the main role but now I'm pretty sure it is the capitalization of the preceded word. I'm not sure how to create a rule for this. I'll really appreciate your suggestions.

text = u"In a recent tweet, labour mp Luciana Berger sought clarification..."#YES

text = u"In a recent tweet, Labour MP Luciana Berger sought clarification..."#NO

text = u'Former minister from Women and equalities Nicky Morgan said...' #YES

text = u'Former minister from Women and Equalities Nicky Morgan said...' #NO

text = u'the acting CEO of Cambridge Analytica, dr Alexander Tayler, said...' #YES

text = u'the acting CEO of Cambridge Analytica, Dr Alexander Tayler, said...' #NO

text = u'victims after the british. Will Kerr, NCA director, said...' #YES

text = u'victims after the British. Will Kerr, NCA director, said...' #NO

text = u'The judge, mr Justice MacDonald, said'#YES

text = u'The judge, Mr Justice MacDonald, said'#NO

lock[bot] commented 6 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.