Closed avissens closed 6 years ago
The v2.0 NER is generally better. You can try the demo here: https://demos.explosion.ai
I don't really know what to tell you in general though --- context and vocabulary item both matter, so having the NER be sensitive to punctuation, titles etc is expected behaviour.
There's also no such thing as "fixed" here, at least not in the sense I think you're asking for. We might update a model and all the cases you're testing might go from green to red, even if our statistics show the model generally has improved. That's also within expected behaviour --- it's a statistical model. You might prefer creating a rule-based system if you want finer-grained control and more predictability. The Matcher
class is useful for that.
We also have an annotation tool, https://prodi.gy , to help people fine-tune the models on their data. This isn't a silver-bullet either though: there's no way to guarantee how much or in what way you'll need to annotate data to improve the model's performance.
Thank you, yes, I'm trying to add more and more rules. But when it comes to names themselves, I'm not sure I even understand the behaviour. Why it picks James but not Luciana in the same context? I'd like to hear your thoughts.
a-
On 28 Mar 2018, at 22:26, Matthew Honnibal notifications@github.com<mailto:notifications@github.com> wrote:
The v2.0 NER is generally better. You can try the demo here: https://demos.explosion.ai
I don't really know what to tell you in general though --- context and vocabulary item both matter, so having the NER be sensitive to punctuation, titles etc is expected behaviour.
There's also no such thing as "fixed" here. We might update a model and all the cases you're testing might go from green to red, even if our statistics show the model generally has improved. That's also within expected behaviour --- it's a statistical model. You might prefer creating a rule-based system if you want finer-grained control and more predictability. The Matcher class is useful for that.
We also have an annotation tool, https://prodi.gy , to help people fine-tune the models on their data. This isn't a silver-bullet either though: there's no way to guarantee how much or in what way you'll need to annotate data to improve the model's performance.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/explosion/spaCy/issues/2157#issuecomment-377042582, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ALYINLz-yLV85bGdC-B58sVPu_T5gLwjks5tjAAKgaJpZM4S-8hT.
Why it picks James but not Luciana in the same context? I'd like to hear your thoughts.
There's no clear answer for that – the model is statistical, so it's not saying "XY is a person". It's saying that based on the context, it's very likely that "XY" should have the label PERSON
, based on what the model has seen so far.
How the model performs always depends on the data it was trained on. spaCy's English models were trained on a general-purpose corpus of news and web text, which gives pretty good accuracy overall for most regular text types. But it can still make mistakes when you feed it your own, very specific data that might be different from what the model has seen during training.
That's why it's important to always fine-tune the model on your own data. You can do this by adding rules or by updating the entity recognizer with more annotations and labelled examples.
Hi both, I am sorry for insisting on this... But after trying to tune the model for weeks now I think I have identified one problem which is persistent in both 1.8 and 2.0 versions. Basically, if a Person entity comes after a word with a capital letter spacy doesn't recognize it as such. First I thought that punctuation plays the main role but now I'm pretty sure it is the capitalization of the preceded word. I'm not sure how to create a rule for this. I'll really appreciate your suggestions.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
I have some new issues with spaCy missing Person entities. Below are some examples when it seems like spaCy is going blind when the name is preceded with certain titles or punctuation. I reported a similar issue before and it was closed as fixed after a while. But I still find more and more inconsistencies. I commented #YES & #NO to indicate a different outcome with slight changes in punctuation or even names.
I tried to remove punctuation but it affected other names (example above). So no workaround... I'm using 1.8.2 version. But before updating to 2.0 I really would like to make sure that these are fixed there as my first attempt to migrate to 2.0 failed. Could you please advise?
macOS Sierra (10.12.3) Python 2.7 spaCy-1.8.2 Jupiter notebook