explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.38k stars 4.33k forks source link

📚 Inaccurate pre-trained model predictions master thread #3052

Open ines opened 5 years ago

ines commented 5 years ago

This thread is a master thread for collecting problems and reports related to incorrect and/or problematic predictions of the pre-trained models.

Why a master thread instead of separate issues?

GitHub now supports pinned issues, which lets us create master threads more easily without them getting buried.

Users often report issues that come down to incorrect predictions made by the pre-trained statistical models. Those are all good and valid, and can include very useful test cases. However, having a lot of open issues around minor incorrect predictions across various languages also makes it more difficult to keep track of the reports. Unlike bug reports, they're much more difficult to action on. Sometimes, mistakes a model makes can indicate deeper problems that occurred during training or when preprocessing the data. Sometimes they can give us ideas for how to use data augmentation to make the models less sensitive to very small variations like punctuation or capitalisation.

Other times, it's just something we have to accept. A model that's 90% accurate will make a mistake on every 10th prediction. A model that's 99% accurate will be wrong once every 100 predictions.

The main reason we distribute pre-trained models is that it makes it easier for users to build their own systems by fine-tuning pre-trained models on their data. Of course, we want them to be as good as possible, and we're always optimising for the best compromise of speed, size and accuracy. But we won't be able to ship pre-trained models that are always correct on all data ever.

For many languages, we're also limited by the resources available, especially when it comes to data for named entity recognition. We've already made substantial investments into licensing training corpora, and we'll continue doing so (including running our own annotation projects with Prodigy ✨) – but this will take some time.

Reporting incorrect predictions in this thread

If you've come across suspicious predictions in the pre-trained models (tagger, parser, entity recognizer) or you want to contribute test cases for a given language, feel free to submit them here. (Test cases should be "fair" and useful for measuring the model's general accuracy, so single words, significant typos and very ambiguous parses aren't usually that helpful.)

You can check out our new models test suite for spaCy v2.1.0 to see the tests we're currently running.

adrianeboyd commented 2 years ago

@mathcass: That's a known bug in the v3.2.0 models, it's fixed in the v3.3.0 models. (See #9853.)

ricardojosehlima commented 2 years ago

I have found a new (and maybe unwanted?) behavior in Spacy 3.3 for pt_core_news_lg.

Previously in Spacy <3.3 (I am using 3.2):

"Uma menina gosta do menino." ("A girl likes the boy.")

Uma Uma DET det menina menino NOUN nsubj gosta gostar VERB ROOT do do ADP case menino menino NOUN obj . . PUNCT punct

Now, for 3.3:

Uma, um, DET, det menina, menina, NOUN, nsubj gosta, gostar, VERB, ROOT do, de o, ADP, case menino, menino, NOUN, obj ., ., PUNCT, punct

And now there is a word without pos or dep ('de'). Plus, the determiner that is agglutinated with the preposition inherits its pos and dep, but doesn't have a lemma. Was it intended to be so?

As a note, models like Stanza and Udpipe do separate contractions as 'do', but they keep the classifications separated too: 'de' is ADP and case; 'o' is DET and det.

adrianeboyd commented 2 years ago

@ricardojosehlima: I'm not sure how you're viewing the annotation, but the lemma for "do" should be "de o" (with a space). A spacy Doc can't represent the multi-word token like "de o" for the underlying text "do", see an explanation of how we merge UD tokens in this blog post: https://explosion.ai/blog/ud-benchmarks-v3-2

ricardojosehlima commented 2 years ago

Thanks for the quick reply!

Indeed, this was the case. When I changed the visualization way, 3.3 behavior is like 3.2

How I am viewing it, though, leads to the difference:

frasespacy = [(token.orth, token.lemma, token.pos, token.dep_) for token in doc]

frase_spacy_str = ''.join(str(e[0] + '/' + e[1] + '/' + e[2] + '/' + e[3] + '/' +' ') for e in frase_spacy)

print(frase_spacy_str)

for 3.2: Uma/Uma/DET/det/ menina/menino/NOUN/nsubj/ gosta/gostar/VERB/ROOT/ do/do/ADP/case/ menino/menino/NOUN/obj/ ././PUNCT/punct/

for 3.3: Uma/um/DET/det/ menina/menina/NOUN/nsubj/ gosta/gostar/VERB/ROOT/ do/de o/ADP/case/ menino/menino/NOUN/obj/ ././PUNCT/punct/

adrianeboyd commented 2 years ago

That difference is expected. Many v3.3.0 models switched to the new trainable lemmatizer, see: https://spacy.io/usage/v3-3#pipeline-updates

maxTarlov commented 2 years ago

English NER inconsistent on money entities

The English NER model includes the "$" symbol in large quantities of money but not in small amounts of money.

import spacy
from spacy import displacy
nlp = spacy.load('en_core_web_sm')

small_doc = nlp('I have $1')
print(small_doc.ents)  # (1,)
print([i[0].ent_type_ for i in small_doc.ents])  # ['MONEY']
large_doc = nlp('I have $1 million')
print(large_doc.ents)  # ('$1 million',)
print([i[0].ent_type_ for i in large_doc.ents])  # ['MONEY']

This occurs in SpaCy 3.3 for en_core_web_sm, en_core_web_md and en_core_web_lg models.

spacy.info()
{'location': '/usr/local/lib/python3.7/dist-packages/spacy',
 'pipelines': {'en_core_web_lg': '3.3.0',
  'en_core_web_md': '3.3.0',
  'en_core_web_sm': '3.3.0'},
 'platform': 'Linux-5.4.188+-x86_64-with-Ubuntu-18.04-bionic',
 'python_version': '3.7.13',
 'spacy_version': '3.3.0'}
adrianeboyd commented 2 years ago

@maxTarlov: It looks like the training instances in OntoNotes are all normalized to have spaces between $ (or US$, NT$, etc.) and the number (as in "$ 1 million") and they're not consistent about whether the denomination is part of MONEY or not. So I suspect the predicted annotation is always going to be a bit weird/inconsistent for cases like this for these particular pipelines.

nlovell1 commented 2 years ago

Why, in the most recent Spanish trf pretrained model, is the pronoun vos's lemma vo, whereas in the non-trf models it is correctly lemmatized as vos?

polm commented 2 years ago

@thinkingbox12 Can you provide example sentences?

The lemmatizer in the Spanish pipelines is rule-based, so I think it just depends on the POS tag and the surface text. I would guess that the model is giving "vos" a different POS tag for some reason.

nlovell1 commented 2 years ago

@polm, sure

import spacy
import pandas as pd

from spacy import displacy
nlp = spacy.load('es_dep_news_trf')
text = "Te dije que vos no estabas hecho para la política. ¿Pablo pero vos escuchaste lo que dijo ese desgraciado? Gonzalo, vos no aguantas un chiste. Eh... ¿Cómo es que te llamas vos? ¿Oíste, vos no estás ni un poquito nervioso? ¿Quién no se va a acordar de un hombre como vos? ¿Pero vos qué pretendes, Pablo? Pedro, ¿sos vos? ¿Y vos que sabes de hacer leyes? ¿A vos que te pasa Irma? Ave maría con vos, Pablo."
doc = nlp(text)

lemmas = [(t.i, t.text, t.lemma_, t.pos_, t.morph) for t in doc if t.text == "vos"]
df = pd.DataFrame(lemmas, columns=["idx", "text", "lemma", "pos", "morph"]).head(25)
with pd.option_context("display.max_rows", 100, "display.width", None, "display.max_colwidth", None):
    display(df)

Text, for easier view

Te dije que vos no estabas hecho para la política.
¿Pablo pero vos escuchaste lo que dijo ese desgraciado?
Gonzalo, vos no aguantas un chiste.
Eh... ¿Cómo es que te llamas vos?
¿Oíste, vos no estás ni un poquito nervioso?
¿Quién no se va a acordar de un hombre como vos?
¿Pero vos qué pretendes, Pablo?
Pedro, ¿sos vos?
¿Y vos que sabes de hacer leyes?
¿A vos que te pasa Irma?
Ave maría con vos, Pablo.

And the analysis with TRF:

Screen Shot 2022-06-09 at 1 12 24 PM

Core news lg instead:

Screen Shot 2022-06-09 at 1 12 00 PM
polm commented 2 years ago

OK, the difference is you have different parts of speech, resulting in different lemmas.

nlovell1 commented 2 years ago

@polm makes sense, but I guess I was confused mainly why vos's lemma is vo in the TRF model. It's not a convention that I see used in any authoritative source, unless there's another reason behind. Thanks so much!

itssimon commented 1 year ago

Strange inconsistency in senter:

import spacy

nlp = spacy.load("en_core_web_md")
doc1 = nlp("Location(s): Left")
doc2 = nlp("Location(s): Right")
print(list(doc1.sents))
print(list(doc2.sents))

# Prints:
# [Location(s): Left]
# [Location(s):, Right]

Even though the texts are structurally identical, one gets divided up into two sentences and the other one doesn't.

richardpaulhudson commented 1 year ago

Unfortunately this is the sort of thing you have to live with when using machine-learning models. In this case, however, one could speculate as to why the model might conceivably have learned such behaviour: Right is an interjection that can begin a sentence or even form a sentence of its own, e.g. Right, come on! while the same is not true of Left.

rjadr commented 1 year ago

I'm running into a persistent inconsistency for the xx_ent_wiki_sm model: it is unable to identify trailing entities (ie. entities at the end of the input text).

While the default English model works fine:

nlp = spacy.load("en_core_web_sm")
doc = nlp("Amsterdam and Rotterdam")
print([ent.text for ent in doc.ents if ent.label_ == "GPE"])

And correctly prints: ['Amsterdam', 'Rotterdam'], the multi-language model misses the last entity:

nlp = spacy.load("xx_ent_wiki_sm")
doc = nlp("Amsterdam and Rotterdam")
print([ent.text for ent in doc.ents if ent.label_ == "LOC"])

Will output: ['Amsterdam']

If we add any random string to the sentence, eg "Amsterdam and Rotterdam hi", it does recognize the entity: ['Amsterdam', 'Rotterdam'].

adrianeboyd commented 1 year ago

@rjadr My best guess is that nearly all training sentences from wikipedia have sentence-final punctuation, so it's learned that it's unlikely that the final token in a text is part of entity. Looking at the training data, I can count ~900k document-final tokens and only ~2k are entities.

adrianeboyd commented 1 year ago

@Woodchucks: We also noticed this, and it appears to be a problem related to the whitespace augmentation in the training settings for a tagger that's trained on its own rather than with a shared tok2vec, where Polish is the only language in the provided trained pipelines with a completely independent tagger component.

To be honest the behavior is pretty bizarre and surprising. It doesn't show up (at least not enough to lead to much lower TAG scores) in evaluations of the dev data, which might be due to fewer unseen tokens in the dev data from the same corpus, and it's still possible there's an underlying bug. We haven't noticed this for other languages, so it seems like training a tagger with a shared tok2vec (with a morphologizer, lemmatizer, and/or parser) prevents the model from predicting that unseen tokens might be _SP, but in this case, the tagger on its own seems to lump whitespace tokens and unseen tokens into the same category.

The upcoming v3.5.0 trained pipelines for Polish should improve this by adding IS_SPACE as a feature so that the model has enough information to differentiate whitespace tokens from other tokens.

Woodchucks commented 1 year ago

@adrianeboyd Thank you for the fast reply. I didn't notice your respond so I've deleted my comment and published it again as issue #12002. Sorry for the inconvenience. Glad to hear that the new version will have the IS_SPACE feature implemented.

stefan-veezoo commented 1 year ago

Hi, I encountered an issue where in German the token "20-Plus" is wrongly tagged as "SPACE", which could hint towards a data issue:

Screenshot from 2022-12-28 16-10-25

https://demos.explosion.ai/displacy?text=Kunden%20mit%20dem%20Produkt%2020-Plus&model=de_core_news_sm&cpu=1&cph=1

adrianeboyd commented 1 year ago

This is related to the same underlying issue as #12002, where data augmentation involving whitespace seems to sometimes lead to unknown words being tagged as SPACE.

Maybe we should just add IS_SPACE to all the models now and consider updating SHAPE to normalize spaces in v4 so that we can drop IS_SPACE, since there's a slight speed hit.

probavee commented 1 year ago

Hello ! Following the answer I got in this discussion, I'm reposting my issue on this master thread. I'm using the french transformer model fr_dep_news_trf. When processing this sentence "Je vais skier dans les Alpes de France cet hiver." The model predicts accurately that "Alpes" is a PROPN. But when I duplicate this sentence like "Je vais skier dans les Alpes de France cet hiver. Je vais skier dans les Alpes de France cet hiver." It now tags "Alpes" as a NOUN.

Here are 2 examples with different versions of the model done in a Linux environment with python 3.10.

spacy-transformers == 1.2.0
spacy == 3.5.0
fr_dep_news_trf == 3.5.0
> doc = nlp("Je vais skier dans les Alpes de France cet hiver.")
> [(i.lemma_, i.pos_) for i in doc if i.text == "Alpes"]

[('Alpes', 'PROPN')]

> doc = nlp("Je vais skier dans les Alpes de France cet hiver. " *10)
> [(i.lemma_, i.pos_) for i in doc if i.text == "Alpes"]

[('alpe', 'NOUN'), ('alpe', 'NOUN'), ('alpe', 'NOUN'), ('alpe', 'NOUN'), ('alpe', 'NOUN'), ('alpe', 'NOUN'), ('alpe', 'NOUN'), ('alpe', 'NOUN'), ('alpe', 'NOUN'), ('alpe', 'NOUN')]

With another version, there is far less wrong predictions but still some at some point.

spacy-transformers == 1.1.9
spacy == 3.4.4
fr_dep_news_trf == 3.4.0
> doc = nlp("Je vais skier dans les Alpes de France cet hiver.")
> [(i.lemma_, i.pos_) for i in doc if i.text == "Alpes"]

[('Alpes', 'PROPN')]

> doc = nlp("Je vais skier dans les Alpes de France cet hiver. " *10)
> [(i.lemma_, i.pos_) for i in doc if i.text == "Alpes"]

[('alpe', 'NOUN'), ('Alpes', 'PROPN'), ('Alpes', 'PROPN'), ('Alpes', 'PROPN'), ('Alpes', 'PROPN'), ('Alpes', 'PROPN'), ('Alpes', 'PROPN'), ('Alpes', 'PROPN'), ('Alpes', 'PROPN'), ('alpe', 'NOUN')]

I'd like to know if it is expected from the model or not. Like, is this just because I don't give it enough context or something else. The word France in the sentences is always well tagged. It seems that there is always a threshold of tokens where the predictions get wrong.

Thank you for your help!

postnubilaphoebus commented 1 year ago

Spacy's English named entity recognition has issues with apostrophes. Using Spacy 3.5.0, please try the following code:

import spacy
nlp = spacy.load("en_core_web_sm", disable=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer"])
doc = nlp("That had been Megan's plan when she got him dressed earlier.")
labels = [ent.label_ for ent in doc.ents]
entity_text = [ent.text for ent in doc.ents]
print(labels) 
print(entity_text)

This returns [ORG] for Megan insetad of [PERSON]. Similar issues occur with, for example, the word "Applebee's".

rmitsch commented 1 year ago

Thanks for reporting this, @postnubilaphoebus. The small model being doesn't do that well with names not occuring often enough in the training data. I recommend giving en_core_web_md a shot (it's inferring the correct entity label in your example).

stestagg commented 1 year ago

Hi!

We've spotted some NSUBJ/DOBJ mixups with parsing sentences using en_core_web_trf (3.5) that start with Make:

For example:

import spacy
print(f'Spacy={spacy.__version__}')
en = spacy.load('en_core_web_trf')
print(f'Lang={en.path.name}')
sent = en('Make the compression used between map reduce tasks configurable.')
' '.join([f'{t}({t.dep_})' for t in sent])

Outputs:

Spacy=3.5.0
Lang=en_core_web_trf-3.5.0

'Make(ROOT) the(det) compression(nsubj) used(acl) between(prep) map(nmod) reduce(compound) tasks(pobj) configurable(ccomp) .(punct)'

There should not be an nsubj in this sentence. This should be:

'Make(ROOT) the(det) compression(dobj) used(acl) between(prep) map(nmod) reduce(compound) tasks(pobj) configurable(ccomp) .(punct)'

Other examples include:

Make the output of the reduce side plan optimized by the correlation optimizer more reader-friendly.
Make ZooKeeper easier to test - support simulating a connection loss
Make compaction more robust when stats update fails
...

All of these put an nsubj where there should be a dobj.

Note, I tested 3.3.4, and 3.4.4 and they seemed to do the same thing

adrianeboyd commented 1 year ago

Imperatives and questions are two very common things that most of our trained pipelines perform poorly on because they are rare in typical newspaper training data.

cbowdon commented 1 year ago

What is the NER training data for English please? I see some models (e.g. German) are trained on WikiNER but none of the referenced sources for English models (e.g. here) are related to NER.

Apologies if this is the wrong place to ask, I was drawn here from other related issues.

adrianeboyd commented 1 year ago

Hi @cbowdon, OntoNotes does contain NER annotation, see: https://catalog.ldc.upenn.edu/docs/LDC2013T19/OntoNotes-Release-5.0.pdf

cbowdon commented 1 year ago

@adrianeboyd Thank you!

giova-p commented 1 year ago

Hi there!

I've come across an anomaly in the parsing component of the 'en_core_web_sm' model. Specifically, I've noticed that the verb 'need' is sometimes labeled as the root of the sentence, while in other cases, it's labeled as an 'aux'.

Even more strangely, when the same sentence is repeated twice or more, the behavior of the parsing component becomes erratic. Take this example: "the member states need not do something. the member states need not do something." In the first sentence, the subject is a "child" of the root verb 'do', while in the second sentence (which is identical!), the subject is the child of the 'aux'.

I've tried to replicate this behavior with other examples, but the anomaly is not always present. I'd appreciate any insights or suggestions on whether you think this could arise in other circumstances as well.

Thanks! atb g.

adrianeboyd commented 12 months ago

Hi @giova-p, yes, the predictions of the statistical models depend on a context window that can go beyond a single sentence, so you will see differences like this in practice.

A pipeline should output the same predictions for the exact same input text string every time, but if anything is modified in the text, even adding whitespace, you may see different predictions.

Arjuman23 commented 11 months ago

I have identified a discrepancy in the entities detected by the "en_ner_bc5cdr_md-0.5.1" model between results obtained from a Windows system and an Ubuntu system. According to the readme file of the "en_ner_bc5cdr_md-0.5.1" model, it is trained up to Spacy version 3.5.0. Interestingly, this alignment holds true for the Windows system. Whenever I adjust the Spacy version to a value above 3.5.0, the named entity recognition (NER) results are no longer produced. The model en_ner_bc5cdr_md-0.5.0 worked irrespective of the spacy version.

However, an interesting scenario emerged when I conducted the same experiment on an Ubuntu system. Here, the "en_ner_bc5cdr_md-0.5.1" model generated NER outputs regardless of the Spacy version I employed. I even tested it with versions like 3.6.1 and even lower than that.

This leads me to the question: Why is this discrepancy in behavior occurring between the Windows and Ubuntu systems? Is this a known issue? Am I missing something??

svlandeg commented 11 months ago

Hi @Arjuman23,

If I understand you correctly, both en_ner_bc5cdr_md-0.5.0 and en_ner_bc5cdr_md-0.5.1 work fine on Ubuntu & Windows within the spaCy ranges specified for these model, right?

From the release notes, I gather that the 0.5.0 models were trained with 3.2.3 and the 0.5.1 models with 3.4.x. Note that we don't actually train or maintain these models - AllenAI does.

In general, you can run python -m spacy validate to double check whether a model in your environment is compatible with the spaCy version. If it's not, I'm afraid we can't really make any guarantees about its behaviour.

https://github.com/allenai/scispacy/issues

Arjuman23 commented 11 months ago

Hi @svlandeg, Thank you for your response. Much appreciated. You've pointed it out right, both the models work fine within te spacy ranges specified in their readme files, but on windows. On Ubuntu, they work on the latest spacy versions as well, without any hassle (eg .3.6.1) I totally agree that AllenAI maintains them, but I didn't know how to report this to them. Hence I came down to its roots :P If you can connect me to them, it would be helpful.

svlandeg commented 11 months ago

You could contact them through their issue tracker, but to be honest I'm not sure there's a bug to be solved here. The expected behaviour is that the models work within their range, and not outside of it. It might accidentally do work on some systems outside of the "correct" spaCy range, for various reasons I'm not sure of. Again, you can ask them / report this to them, but I don't think there's something to be fixed here (I agree it's weird behaviour though).

Mindful commented 9 months ago

I'm not sure if this counts as a pre-trained model prediction given that the tokenizer is rule-based, but it looks like spaCy's English tokenizer splits the verb "wed". See below: https://demos.explosion.ai/displacy?text=The%20couple%20was%20wed%20yesterday.&model=en_core_web_sm&cpu=1&cph=1

If this isn't a mistake, I can imagine it might be a way to deal with common typos of we'd as wed, but it's a little inconvenient.

edit: the same thing happens with the noun cant. I'm not sure if there's a good way to fix this, it seems like you would need POS or syntax information to make judgements about whether something was likely to be a typo or not.

rafa852 commented 9 months ago

This thread is a master thread for collecting problems and reports related to incorrect and/or problematic predictions of the pre-trained models.

Why a master thread instead of separate issues?

GitHub now supports pinned issues, which lets us create master threads more easily without them getting buried.

Users often report issues that come down to incorrect predictions made by the pre-trained statistical models. Those are all good and valid, and can include very useful test cases. However, having a lot of open issues around minor incorrect predictions across various languages also makes it more difficult to keep track of the reports. Unlike bug reports, they're much more difficult to action on. Sometimes, mistakes a model makes can indicate deeper problems that occurred during training or when preprocessing the data. Sometimes they can give us ideas for how to use data augmentation to make the models less sensitive to very small variations like punctuation or capitalisation.

Other times, it's just something we have to accept. A model that's 90% accurate will make a mistake on every 10th prediction. A model that's 99% accurate will be wrong once every 100 predictions.

The main reason we distribute pre-trained models is that it makes it easier for users to build their own systems by fine-tuning pre-trained models on their data. Of course, we want them to be as good as possible, and we're always optimising for the best compromise of speed, size and accuracy. But we won't be able to ship pre-trained models that are always correct on all data ever.

For many languages, we're also limited by the resources available, especially when it comes to data for named entity recognition. We've already made substantial investments into licensing training corpora, and we'll continue doing so (including running our own annotation projects with Prodigy ✨) – but this will take some time.

Reporting incorrect predictions in this thread

If you've come across suspicious predictions in the pre-trained models (tagger, parser, entity recognizer) or you want to contribute test cases for a given language, feel free to submit them here. (Test cases should be "fair" and useful for measuring the model's general accuracy, so single words, significant typos and very ambiguous parses aren't usually that helpful.)

You can check out our new models test suite for spaCy v2.1.0 to see the tests we're currently running.

cyriaka90 commented 7 months ago

Hey, here are some inaccurate parses I encountered (all using spacy version 3.7.2):

glangford commented 6 months ago

The following Portuguese sentences, which all have a verb capitalized to start the sentence, result in an incorrect lemma for the verb (pt_core_news_lg, spacy 3.7.2)

"Trabalharam com honra e dignidade e estiveram entre os melhores." "Fale só um bocadinho sobre o Festival." "Surge detrás das cortinas." "Encontrei as chaves." "Reserve voos baratos." (this one is from the earlier comment https://github.com/explosion/spaCy/issues/3052#issuecomment-1866537324)

In each case, the lemma of the first word is given as the word unchanged.

If the first word is lower cased, the correct lemmas are produced (trabalhar, falar, surgir, encontrar, reservar).