explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.21k stars 4.32k forks source link

📚 Inaccurate pre-trained model predictions master thread #3052

Open ines opened 5 years ago

ines commented 5 years ago

This thread is a master thread for collecting problems and reports related to incorrect and/or problematic predictions of the pre-trained models.

Why a master thread instead of separate issues?

GitHub now supports pinned issues, which lets us create master threads more easily without them getting buried.

Users often report issues that come down to incorrect predictions made by the pre-trained statistical models. Those are all good and valid, and can include very useful test cases. However, having a lot of open issues around minor incorrect predictions across various languages also makes it more difficult to keep track of the reports. Unlike bug reports, they're much more difficult to action on. Sometimes, mistakes a model makes can indicate deeper problems that occurred during training or when preprocessing the data. Sometimes they can give us ideas for how to use data augmentation to make the models less sensitive to very small variations like punctuation or capitalisation.

Other times, it's just something we have to accept. A model that's 90% accurate will make a mistake on every 10th prediction. A model that's 99% accurate will be wrong once every 100 predictions.

The main reason we distribute pre-trained models is that it makes it easier for users to build their own systems by fine-tuning pre-trained models on their data. Of course, we want them to be as good as possible, and we're always optimising for the best compromise of speed, size and accuracy. But we won't be able to ship pre-trained models that are always correct on all data ever.

For many languages, we're also limited by the resources available, especially when it comes to data for named entity recognition. We've already made substantial investments into licensing training corpora, and we'll continue doing so (including running our own annotation projects with Prodigy ✨) – but this will take some time.

Reporting incorrect predictions in this thread

If you've come across suspicious predictions in the pre-trained models (tagger, parser, entity recognizer) or you want to contribute test cases for a given language, feel free to submit them here. (Test cases should be "fair" and useful for measuring the model's general accuracy, so single words, significant typos and very ambiguous parses aren't usually that helpful.)

You can check out our new models test suite for spaCy v2.1.0 to see the tests we're currently running.

mauryaland commented 5 years ago

Doing an annotation project for pre-trained model with prodigy looks like a really good idea! Do you have some ideas when it could happens and who will be able to participate?

ines commented 5 years ago

From #3070: English models predict empty strings as tags (confirmed also in nightly).

>>> nlp = spacy.load("en_core_web_sm")
>>> doc = nlp("I like  London and Berlin")
>>> [(t.tag_, t.pos_) for t in doc]
[('PRP', 'PRON'), ('VBP', 'VERB'), ('', 'SPACE'), ('NNP', 'PROPN'), ('CC', 'CCONJ'), ('NNP', 'PROPN')]

From #2313: Similar problem in French (confirmed also in nightly).

>>> nlp = spacy.load("fr_core_news_sm")
>>> doc = nlp("Nous a-t-elle précisé ceci?")
>>> [x.pos_ for x in doc] # ['PRON', '', 'PART', 'PRON', 'VERB', 'PRON', 'PUNCT']
['PRON', '', 'ADV', 'PRON', 'VERB', 'PRON', 'PUNCT']
>>> doc = nlp("Nous a t-elle précisé ceci?")
>>> [x.pos_ for x in doc] # ['PRON', 'AUX', 'NOUN', 'VERB', 'PRON', 'PUNCT']
['PRON', 'AUX', 'VERB', 'VERB', 'PRON', 'PUNCT']
honnibal commented 5 years ago

@mauryaland I hope we can have annotations starting in January. The first data to be annotated will be English and German, with other annotation projects hopefully starting fairly quickly.

We'll probably be hiring annotators to most of the work. We might do a little bit of "crowd sourcing" as a test, but we mostly believe annotation projects run better with fewer annotators. What we would benefit from is having one person per treebank overseeing the work, communicating with the annotators, and making language-specific annotation policy decisions.

mehmetilker commented 5 years ago

I am trying to upgrade from 2.0.x to 2.1 but seeing different results for small English model. It is not meaningful come to conclusion case by case but I see accuracy decreased with some POS and Dependency tagging.

In general verb become noun. Dependencies lost or changed. Especially this one: (2.0 )tight/VB [advmod] tight/RB (2.1) tight/VB [acomp] tight/RB > to be able to acomp tight should be adjective I guess.

Should I assume new models (2.1.0a5) will change when 2.1 comes to release or we should not expect change.

POS Problems

VB > NN doubles/VB > NNS
From the restaurant, the Seventh's boundary doubles back east along the Pennsylvania Turnpike.

hooking/VB > NN
No hooking up with college kids.

Dependency changes

sit tight
tight/VB [advmod] tight/RB > advmod > (2.1) acomp (adjectival complement but tight is adverb)
As for Russia's sovereign debt, most investors are sitting tight, believing Washington will not bar investors from it, even if the U.S.

speed [compound] skating > no dep (2.1)
Yes, but all the Dutch medals are in speed skating only.

cross [acl] examined > no dep
Mr Goodwin is due to be cross examined on 8 June, the day of the general election.

test [dep] fly > dependency reversed They hope to test-fly their craft at Clow International Airport.

cross [npadvmod] examines
(2.1) dependencies connected over punctiation "-" The witness was cross-examined by the defense.

amperinet commented 5 years ago

Hi guys !

Just a quick question regarding the missing tags issue mentionned above (From #2313: Similar problem in French (confirmed also in nightly).) : does this come from the models ? Are you working on this ? In case it can help, I am adding examples with missing tags :

Thank you!

adrianeboyd commented 5 years ago

I'm not sure whether this belongs here or in its own issue, but I noticed that the tagger in spacy 2.1 en_core_web_md (2.1.0) seems to have some major problems.

I ran a quick evaluation on the PTB tags in UD_English-PUD with the following results (without normalizing punctuation tags, so the actual results would be a bit higher):

Model Tagging Acc.
------------------
sm       0.945
md       0.792
lg       0.952

The performance is similar for UD tags and for other corpora. With spacy 2.0, the results for all three models are similar.

I suspect these problems are what led to this hacky modification to a model test case, which now doesn't catch the error it's supposed to catch:

https://github.com/explosion/spacy-models/commit/b516a3bd066f8dc483e69a7aa99a26ea9566d687#diff-09cdc890bfe36b8c3ac094953ad251bd

Below are simplified confusion matrices for the more frequent non-punctuation tags for md vs lg, where you can see that something has gone wrong in the md model (sm looks similar to lg). I was hoping to see a clear pattern that explained the errors (like two consistently swapped tags), but it's so all over the place that my first guess would be that there was an offset error for some portion of the training data.

spacy21_en_core_web_md spacy21_en_core_web_lg

alessio-greco commented 5 years ago

Model (at least "en_core_web_sm") fails in prediction whenever capitalization is not correctly used. For example, comparing the predictions for "j.k. rowling wishes snape happy birthday in the most magical way" and "J.K. Rowling Wishes Snape Happy Birthday In The Most Magical Way" https://puu.sh/DiNTh/d3b940ef65.png

First has "rowling" considered as a verb, despite wishes being a verb. Second tend too easily to assign NNP and assigning "ROOT" Best version would be "J.K. Rowling wishes Snape happy birthday in the most magical way", which still makes "Snape" ent_typeB GPE.

This kind of errors are costant every time a supposedly capitalized name ( e.g "United States" ) isn't capitalized or a supposedly non - capitalized name is capitalizaed. This gives problem applying the model with, for example, headlines (which have all words first letters capitalized)

adam-ra commented 5 years ago

In [6]: nlp('acetaminophen')[0].tag, nlp('acetaminophen')[0].pos
Out[6]: ('UH', 'INTJ')

dlemke01 commented 5 years ago

Hi, Lemma for "multiplies" should be "multiply" right?

(Pdb) tmp = nlp(u"A rabbit multiplies rapidly by having lots of sex.") (Pdb) tmp A rabbit multiplies rapidly by having lots of sex. (Pdb) [token.lemma_ for token in tmp] [u'a', u'rabbit', u'multiplie', u'rapidly', u'by', u'have', u'lot', u'of', u'sex', u'.']

interrogator commented 5 years ago

Sentence tokenisation issue? I thought the nonstandard lexis might be the cause, but normalising it still gives pretty unusual sentence tokenisation:

>>> import spacy                                                
>>> nlp = spacy.load(en)                                                                      
>>> s = 'Me and you are gonna have a talk. \nSez who? \nSez me. \nHey! What did I say?'
>>> doc = nlp(s)
>>> for sent_index, sent in enumerate(doc.sents, start=1):
...    print(sent_index, sent.text) 

1 Me and you are gonna have a talk. 
2 Sez
3 who? 
Sez me. 
4 Hey!
5 What did I say?

>>> s = 'Me and you are gonna have a talk. \nSays who? \nSays me. \nHey! What did I say?'
>>> doc = nlp(s)
>>> for sent_index, sent in enumerate(doc.sents, start=1): 
...     print(sent_index, sent.text) 

1 Me and you are gonna have a talk. 

2 Says who? 
Says me. 

3 Hey!
4 What did I say?
ctrngk commented 5 years ago

Another incorrect lemma_

import spacy nlp = spacy.load('en_core_weblg') doc = nlp("the Greys") [token.lemma for token in doc] ['the', 'Greys']

"Greys" should be "Grey"

ctrngk commented 5 years ago

another lemma_ conflict

import spacy nlp = spacy.load('en_core_weblg') [token.lemma for token in nlp("to be flattered by sth")] ['to', 'be', 'flatter', 'by', 'sth'] # correct [token.lemma_ for token in nlp("to feel flattered that")] ['to', 'feel', 'flattered', 'that'] #error

The second "flattered" should be "flatter"

dataframing commented 5 years ago

I've noticed the tokenization and entity recognition around compact numbers (e.g., 10k, 20M) can be a bit of a mixed bag. Here's a tiny snippet:

from pprint import pprint

from spacy.matcher import Matcher
import en_core_web_sm
import spacy

print(f"Spacy version: {spacy.__version__}")
# Spacy version: 2.1.4

nlp = en_core_web_sm.load()
doc = nlp("Compact number formatting: 5k 5K 1m 1M")
pprint([(t.text, t.ent_type_) for t in doc])
# [('Compact', ''),
#  ('number', ''),
#  ('formatting', ''),
#  (':', ''),
#  ('5k', 'CARDINAL'),
#  ('5', 'CARDINAL'),
#  ('K', 'ORG'),
#  ('1', 'ORG'),
#  ('m', ''),
#  ('1', 'CARDINAL'),
#  ('M', '')]
Gnuelp commented 4 years ago

Small English model 2.1 assigns pos_= VERB to "cat". Medium model works fine. Small and medium models 2.0 work fine, too. I thought I'd report this despite being a single word inaccuracy, since the example is from https://course.spacy.io/ from chapter1_03_rule-based-matching.md Section "Matching other token attributes"

import spacy

# Import the Matcher
from spacy.matcher import Matcher

# Load a model and create the nlp object
nlp = spacy.load('en_core_web_sm')

# Initialize the matcher with the shared vocab
matcher = Matcher(nlp.vocab)

# Add the pattern to the matcher
pattern = [
    {'LEMMA': 'love', 'POS': 'VERB'},
    {'POS': 'NOUN'}
]
matcher.add('loving', None, pattern)

# Process some text
doc = nlp("I loved dogs but now I love cats more.")

for token in doc:
    print(token.text, token.lemma_, token.pos_, token.dep_, token.head.text)

I -PRON- PRON nsubj loved loved love VERB ROOT loved dogs dog NOUN dobj loved but but CCONJ cc loved now now ADV advmod love I -PRON- PRON nsubj love love love VERB conj loved cats cat VERB dobj love more more ADV advmod love . . PUNCT punct love

Gnuelp commented 4 years ago

Another issue where there is also a problem with the small English model 2.1, but not with medium model is related to https://github.com/explosion/spaCy/issues/3305.

import spacy

# Load a model and create the nlp object
nlp = spacy.load('en_core_web_sm')

# Import the Doc and Span classes
from spacy.tokens import Doc, Span

# The words and spaces to create the doc from
words = ['Hello', 'world', '!']
spaces = [True, False, False]

# Create a doc manually
doc = Doc(nlp.vocab, words=words, spaces=spaces)

# Create a span manually
span = Span(doc, 0, 2)

# Create a span with a label
label = nlp.vocab.strings['GREETING']
print (label)

span_with_label = Span(doc, 0, 2, label=label)

# Add span to the doc.ents
doc.ents = [span_with_label]

12946562419758953770


ValueError Traceback (most recent call last)

in 21 print (label) 22 ---> 23 span_with_label = Span(doc, 0, 2, label=label) 24 25 # Add span to the doc.ents span.pyx in spacy.tokens.span.Span.__cinit__() ValueError: [E084] Error assigning label ID 12946562419758953770 to span: not in StringStore.
adrianeboyd commented 4 years ago
test case: label ID not in StringStore > Another issue where there is also a problem with the small English model 2.1, but not with medium model is related to #3305. > > ``` > import spacy > > # Load a model and create the nlp object > nlp = spacy.load('en_core_web_sm') > > # Import the Doc and Span classes > from spacy.tokens import Doc, Span > > # The words and spaces to create the doc from > words = ['Hello', 'world', '!'] > spaces = [True, False, False] > > # Create a doc manually > doc = Doc(nlp.vocab, words=words, spaces=spaces) > > # Create a span manually > span = Span(doc, 0, 2) > > # Create a span with a label > label = nlp.vocab.strings['GREETING'] > print (label) > > span_with_label = Span(doc, 0, 2, label=label) > > # Add span to the doc.ents > doc.ents = [span_with_label] > ``` > > > 12946562419758953770 > > > > ValueError Traceback (most recent call last) > > in > > 21 print (label) > > 22 > > ---> 23 span_with_label = Span(doc, 0, 2, label=label) > > 24 > > 25 # Add span to the doc.ents > > span.pyx in spacy.tokens.span.Span.**cinit**() > > ValueError: [E084] Error assigning label ID 12946562419758953770 to span: not in StringStore.

@Gnuelp Sorry for the late response here. That's a slightly tricky case where nlp.vocab.strings differentiates between just looking up a string vs. adding a new string to the StringStore. It will work with:

label = nlp.vocab.strings.add('GREETING')

As of spacy 2.1 this is simplified and you can just specify the label as a string in the Span:

span_with_label = Span(doc, 0, 2, label='GREETING')
hawkrobe commented 4 years ago

When I updated to 2.2, I noticed that the pre-trained POS tagger started automatically tagging all sentence-initial nouns as PROPN instead of NOUN as it did before. This is a good heuristic for some datasets, but it threw off my pipeline (my data contains mostly bare noun labels like "man with a newspaper" or "spy", where this behavior was unexpected.)

adrianeboyd commented 4 years ago

@hawkrobe Interesting observation! This is an unexpected side effect of trying to make the models less sensitive to capitalization overall, with the intention of improving performance on data that doesn't look like formally edited newspaper text. The main difference to the 2.1 models is that some training sentences are randomly lowercased. Since your data is not really the kind of data spacy's models are intended for, I don't think it makes to try to optimize general-purpose models for this case (although it's still very useful to be made aware of these kinds of changes in behavior!).

One possible workaround is to use a framing sentence that you can insert your phrases into that looks more like newspaper text. Something like:

"The president saw the [bare NP] yesterday."

Then you are much more likely to get the correct analysis and you can extract the annotation that you need.

hawkrobe commented 4 years ago

@adrianeboyd aha, I appreciate the info about the changes that led to this side effect, and I love the idea of using the framing sentence to bring the bare NPs closer to in-sample text.

endersaka commented 4 years ago

Hi,

I have noticed some inconsistencies with Lemmatizer and Stop Words for the Italian model. I don't know if this is the best place where to report it but, please, forgive me, I am not very expert about ML, DL, models and so on, and I am learning. In particular I am not sure if lemmatization is perfomerd by the pretrained model or what else. Currently I am following this tutorial "Classify text using spaCy" to test and understand what spaCy is capable of.

I can summarize my issue with some code.

import spacy
from spacy.lang.it.stop_words import STOP_WORDS

nlp = spacy.load('it_core_news_sm')
sbd = nlp.create_pipe('sentencizer')
nlp.add_pipe(sbd, before="parser")

stop_words = STOP_WORDS

doc = 'Ciao marco, ieri ero sul lavoro per questo non ti risposi!'

# Create list of word tokens
token_list = []
for token in doc:
    token_list.append(token.text)
print(token_list)

# create list of sentence tokens
sents_list = []
for sent in doc.sents:
    sents_list.append(sent.text)
print(sents_list)

filtered_words = []
for word in doc:
    if word.is_stop is False:
        filtered_words.append(word)
print('filtered_words:', filtered_words)

# finding lemma for each word
for word in doc:
    print(word.text, word.lemma_)

This is the result.

['Ciao', 'marco', ',', 'ieri', 'ero', 'sul', 'lavoro', 'per', 'questo', 'non', 'ti', 'risposi', '!']
['Ciao marco, ieri ero sul lavoro per questo non ti risposi!']
filtered_words: [Ciao, marco, ,, risposi, !]
Ciao Ciao
marco marcare
, ,
ieri ieri
ero essere
sul sul
lavoro lavorare
per per
questo questo
non non
ti ti
risposi risposare
! !

First of all a translation of the text. The text means: "Hello marco, yesterday I was at work for this I didn't answer you!".

Then I can say that tokenization is correct.

Subdivision in sentences is not very important to me but I think it is correct even if I am not sure if "sentence" is synonymous of "period", if yes, it's correct; if not and "sentencizer" is assumed to split also the parts of a "period", it is wrong.

True problems come with Stop Words and Lemmatizer. I am not sure why Stop Words include especially "lavoro" ("work") and, eventually, "ieri" ("yesterday"). Can any Topic Detector extract any valuable meaning from only: [Ciao, marco, ,, risposi, !]? That is "Hello", the name "Marco" and the verb "to answer"...

Finally Lemmatization gives probably the worst results.

"Marco" is a name; "marcare" is a verb and means "to mark". And "risposi" means "I answered to you" and doesn't stem to "risposare" but "rispondere". "risposare" means "to marry again".

If I can fix such errors someway by myself let me know. I would really like to do it if I can. Otherwise hope that this can be of any help.

Any explanation or help is welcome, I am quite ignorant about this kind of softwares at the moment.

Thank you

adrianeboyd commented 4 years ago

@endersaka: This is the correct place, thanks!

For languages where we don't have a better rule-based lemmatizer, spacy uses lemma lookup tables, which can be better than not having a lemmatizer, but they aren't great. There are some errors in the tables and the results are often not so great because it can't provide multiple lemmas for words like "risposi" or disambiguate based on POS tags. It looks like "risposi" is ambiguous (if you don't have any other context), so the simple lookup table is never going to handle this word correctly in all cases. It looks like "Marco" returns "Marco" and "marco" returns "marcare".

The one advantage of simple lookup tables is that they are easy to understand and modify. The table is here: https://github.com/explosion/spacy-lookups-data/blob/master/spacy_lookups_data/data/it_lemma_lookup.json

The stop words, which look like they've been unmodified since 2016, are here: https://github.com/explosion/spaCy/blob/master/spacy/lang/it/stop_words.py

Stop words are often pretty task-specific, so it's hard to provide a perfect list for everyone. Here's how to modify the stop words: https://stackoverflow.com/a/51627002/461847

We're always happy to get pull requests or other updates from people with more expertise in a particular language! If you potentially have a better source for Italian stop words, let us know...

graftim commented 4 years ago

I've opened an issue before I saw that this thread existed. You can find it here: #4611

endersaka commented 4 years ago

@adrianeboyd sincerely thanks for the explanation. Currently I am taking a look at the documentation on spaCy Web site and more precisely to the Adding Languages page to understand the architecture.

About "risposi", yes! I confirm. It's a very special case. Actually the verb token of the character sequence "non ti risposi" can have two different meanings in two different tenses, depending on the context. In my example case the verb means "to answer" coniugated in simple past tense (that tense, in Italian, translates to "passato remoto", literally "remote past"); "ti", of course, is the pronoun "you" the object of the sentence, receiving the action. "non" is negation. In the case produced by the lemmatizer spaCy takes in to account the other meaning in which the verb is reflective with the help of the pronoun "ti" (also called pronominal verb). In English such reflective form (literally translated) would mean (to be as precise as possible) "you didn't marry yourself once again" :-D

Apart this, funny and interesting matter, I will read carefully the spaCy documentation and try to understand how the Lemmatizer works to eventually propose some modification in the future.

alessio-greco commented 4 years ago

@endersaka spacy models are not caseless. The same happens with the english model: basically it won't analyze a phrase with wrong casing well. Either train another model with a caseless dataset or try a truecaser if your dataset has wrong casing.

adrianeboyd commented 4 years ago

@killia15 The problem is that there are not any occurrences of the word "tu" in the training data that we're using for French, which comes from an older release of this corpus: https://github.com/UniversalDependencies/UD_French-Sequoia/ . The sources are (according to their README): Europarl, Est Republicain newspaper, French Wikipedia and European Medicine Agency.

This is a relatively common problem in corpora that are based on formally edited texts like newspapers or encyclopedia-style texts like wikipedia. I see "vous" but not "tu". This is clearly not great!

Hopefully in the future we can train models with more/better data that won't have this problem. The newest version of the UD GSD corpora, which was released last week, have dropped the non-commercial restriction, so that can potentially provide more data for French (and a few other languages). If you're interested in training a model for French with UD_French-GSD to use now, I can provide a sketch of how to convert and train, which is pretty easy with spacy's CLI commands.

killia15 commented 4 years ago

@adrianeboyd That would certainly explain it! Though great timing with the UD GSD corpora. I would be very interested in training the model. Let me know how I can help. My goal for spacy is to use it for a project where we’re automatically analyzing French texts (news articles, blog posts, poems, passages from books etc) to predict which vocabulary and grammar structures a student will and won’t know so we can make recommendations to their instructor on what they should be working on. Our goal is to publish a paper on it so we’re certainly invested in the success of Spacy’s French model!

adrianeboyd commented 4 years ago

@killia15:

The current spacy release doesn't handle subtokens in a particularly good way (you only see the subtoken strings like de les rather than des), but you can convert the data and train a model like this:

spacy convert -n 10 -m train.conllu .
spacy convert -n 10 -m dev.conllu .
spacy train fr outout_dir train.json dev.json -p tagger,parser

After converting the data, you'll have tags that look like this:

NOUN__Gender=Masc|Number=Sing

After converting and before training, make sure the current lang/fr/tag_map.py has the tags you need. The current tag map just maps to the UD tags like this, so if you don't need the morphological features, you'll just need to check that none are missing (you may have some new combinations of morphological features):

"NOUN__Gender=Masc|Number=Sing": {POS: NOUN},

If you'd like better access to the morphological features (not just as a clunky token.tag_ string), you can expand the mapping to include the features:

"NOUN__Gender=Masc|Number=Sing": {POS: NOUN, 'Gender': 'Masc', 'Number': 'Sing'},

Spacy supports the UD morphological features, so you should be able to do this automatically from a list of the tags in the converted training data. (In the future there should be a statistical morphological tagger, but for now the morphological features are just mapped from the part-of-speech tags.)

BramVanroy commented 4 years ago

As reported here, I encounter some very suspicious parses when using the GPU. Using the CPU version works as expected. Below you find some examples, first the source sentence followed by its tokenisation (which looks alright), then the POS (which look off), and finally DEP (which seems random/uninitialized):

s = "The decrease in 2008 primarily relates to the decrease in cash and cash equivalents 1.\n"
['The', 'decrease', 'in', '2008', 'primarily', 'relates', 'to', 'the', 'decrease', 'in', 'cash', 'and', 'cash', 'equivalents', '1', '.', '\n']
['VERB', 'PRON', 'PROPN', 'NOUN', 'VERB', 'ADV', 'VERB', 'NUM', 'PRON', 'NOUN', 'VERB', 'PROPN', 'PROPN', 'VERB', 'VERB', 'NOUN', 'SPACE']
['dep', 'dep', 'dep', 'dep', 'dep', 'dep', 'dep', 'dep', 'dep', 'dep', 'dep', 'dep', 'dep', 'dep', 'dep', 'ROOT', '']

s = "The Company's current liabilities of €32.6 million primarily relate to deferred income from collaborative arrangements and trade payables.\n"
['The Company', "'s", 'current', 'liabilities', 'of', '&', 'euro;32.6', 'million', 'primarily', 'relate', 'to', 'deferred', 'income', 'from', 'collaborative', 'arrangements', 'and', 'trade', 'payables', '.', '\n']
['NOUN', 'VERB', 'AUX', 'NOUN', 'NOUN', 'PROPN', 'PROPN', 'PROPN', 'VERB', 'VERB', 'ADV', 'VERB', 'VERB', 'NOUN', 'NOUN', 'PROPN', 'NOUN', 'PROPN', 'VERB', 'NUM', 'NOUN', 'SPACE']
['dep', 'dep', 'dep', 'dep', 'dep', 'dep', 'dep', 'dep', 'dep', 'punct', 'dep', 'dep', 'dep', 'dep', 'dep', 'dep', 'dep', 'dep', 'dep', 'dep', 'ROOT', '']

s = 'The increase in deferred income is related to new deals with partners.\n'
['The', 'increase', 'in', 'deferred', 'income', 'is', 'related', 'to', 'new', 'deals', 'with', 'partners', '.', '\n']
['NOUN', 'PROPN', 'PROPN', 'VERB', 'NOUN', 'NOUN', 'NOUN', 'VERB', 'ADV', 'VERB', 'NOUN', 'VERB', 'NOUN', 'SPACE']
['dep', 'dep', 'dep', 'dep', 'dep', 'dep', 'dep', 'dep', 'dep', 'punct', 'dep', 'dep', 'ROOT', '']

Example repo with data here. Note that the issue does not seem to occur on Linux but only on Windows and only when using the GPU.

adrianeboyd commented 4 years ago

@BramVanroy I think this is probably a cupy issue. Their disclaimer:

We recommend the following Linux distributions.

Ubuntu 16.04 / 18.04 LTS (64-bit) CentOS 7 (64-bit)

We are automatically testing CuPy on all the recommended environments above. We cannot guarantee that CuPy works on other environments including Windows and macOS, even if CuPy may seem to be running correctly.

BramVanroy commented 4 years ago

Figured so. Perhaps a (HUGE) disclaimer is welcome in the docs, then. Discouraging people from using the GPU on Windows. If you agree, I can create a pull request.

adrianeboyd commented 4 years ago

I think a warning in the docs would be good. This hasn't come up before, so either it's rare for people to be using a GPU in windows or something changed in thinc/cupy to cause this. Can you see if it does work with slightly older versions of thinc and/or cupy? (We have no way to test this ourselves.)

BramVanroy commented 4 years ago

I think that indeed not a lot of people are using Windows (do you have any stats on this from PyPi perhaps? Would be interesting to see!), but also I don't think it is very well known that GPU-support is available, simply because the CPU-performance is so incredibly good. In my projects I use the CPU version and parallellize it and I never felt like I missed performance, so I never went looking for GPU support. I only recently bumped into it, tried it out on my home PC, and found out it didn't work.

When I find the time, I can dig into this deeper and try older versions. I think I tried down to v2.0 (which also didn't work) but I'll have to check. (It might be useful to re-open the linked topic so I can keep it updated rather than flooding this topic.)

adrianeboyd commented 4 years ago

I think a new issue focused on windows + GPU would be useful. I didn't mean older versions of spacy, just older versions of cupy and maybe thinc (within the compatibility ranges, of course).

BramVanroy commented 4 years ago

Even though I would definitely like to see full blown GPU support for Windows, I'm not sure whether this is something that spaCy can fix if the problem lies in cupy? But if requested I can make a new issue, sure.

adrianeboyd commented 4 years ago

We can't necessarily fix it, but if we (well, you) can figure out that a particular version of cupy works better, we can provide that information in the warning. A new issue could also help people with the same problem when they search in the tracker, since you're right that it's getting pretty off-topic here. (Maybe we can just move all these comments to a new issue?)

umarniz commented 4 years ago

I am having problems validating the accuracy of the nl_core_news_sm when I am trying to run it on the Lassy Small Test Dataset.

I see the model is trained on the same dataset and I am assuming its just the training data but I see the accuracy mentioned in the Github Release is 90.97% for tagging accuracy and when I try to run the same model on the same dataset, I am getting an accuracy of 75.12%.

Github: https://github.com/umarniz/spacy-validate-nl-model

Can you confirm if this is the same dataset the Spacy model is tested on?

Secondly, I was unable to find the code that is used to calculate the accuracy number that are attached in the Github Release. Is there a place where the code used to calculate those numbers is available?

I can imagine sharing the datasets they are run on might not be legal but the code could be useful for people who obtain the dataset themselves to validate :)

adrianeboyd commented 4 years ago

@umarniz That looks like a mistake in the docs, sorry for the confusion! The NER model is trained on NER annotation done on top of UD_Dutch-LassySmall, but the tagger and parser are trained on UD_Dutch v2.0 (since renamed to UD_Dutch-Alpino).

Spacy calculates the accuracy using the internal scorer. You can run an evaluation on a corpus in spacy's training format using the command-line spacy evaluate model corpus.json, which runs nlp.evaluate() underneath.

You can convert a CoNLL-U corpus to spacy's training format with combined tag+morph tags using the command: spacy convert -n 10 -m file.conllu .

Be aware that the reported tagger evaluation is for token.tag not token.pos and is for the UD dev subcorpus rather than the test subcorpus. (It's somewhat confusing labelled POS on the website, but see the note in the mouseover.)

ewaldatsensentia commented 4 years ago

More examples of hyphen being tagged incorrectly (like in 4974): pre-primary co-pay ex-wife ex-mayor de-ice

MartinoMensio commented 4 years ago

Verb "to be" is being marked as AUX instead of VERB when it is actually the main verb.

If I use displacy on this sentence "I have been in Wuhan" why do I see the POS "AUX" on been? Isn't it a verb? https://explosion.ai/demos/displacy?text=I%20have%20been%20in%20Wuhan&model=en_core_web_lg&cpu=1&cph=1

It happens on all the pretrained English models

LiberalArtist commented 4 years ago

I hope this is the right place to report some confusing behavior where spaCy 2.2.3 and en_core_web_md 2.2.5 on Python 3.7 seem to produce a different lemma and part-of-speech tag when a noun is capitalized at the beginning of a sentence. I've minimized an example with the word "time", but I have seen what appears to be the same issue with the words "psychoanalysis" and "interpretation", at least. This program:

import en_core_web_md

srcs = ["An historian employs most of these words at one time or another.",
        "Our first task is to understand our own times.",
        "Time is therefore that mediating order.", # problem!
        "Times are changing."]

nlp = en_core_web_md.load()

rslts = [[{"lemma": t.lemma_, "tag": t.tag_}
            for t in doc if "time" == t.norm_[0:4]][0]
         for doc in nlp.pipe(srcs)]

if __name__ == "__main__":
    import sys
    import json
    json.dump(rslts, sys.stdout, indent=1, sort_keys=True)
    sys.stdout.write("\n")

produces the following output:

[
 {
  "lemma": "time",
  "tag": "NN"
 },
 {
  "lemma": "time",
  "tag": "NNS"
 },
 {
  "lemma": "Time",
  "tag": "NNP"
 },
 {
  "lemma": "time",
  "tag": "NNS"
 }
]

I expected the use of "time" in the third sentence ("Time is therefore that mediating order.") to be lematized as "time" and tagged as "NN", consistent with the other examples.

adrianeboyd commented 4 years ago

@LiberalArtist : The v2.2. models are using some new data augmentation to try to make them less case-sensitive, which leads to less certainty about NN vs. NNP distinctions, and for Time in particular, the training data includes lots of cases of Time as NNP from the magazine Time or Time Warner.

adrianeboyd commented 4 years ago

@MartinoMensio: The POS tags are mapped from the fine-grained PTB tag set, which doesn't distinguish auxiliary verbs from main verbs. All verbs get mapped to VERB except for some exceptions below, where everything that might be an AUX gets mapped to AUX:

https://github.com/explosion/spaCy/blob/99a543367dc35b12aad00c4cd845ddd1f4870056/spacy/lang/en/morph_rules.py#L388-L487

This mapping is kind of crude, and we're working on statistical models for morphological tagging and POS tags to replace this.

lingdoc commented 4 years ago

I'm not sure if this is the place to ask, but I'm wondering, given the state of the art for POS tagging as reported by: https://aclweb.org/aclwiki/POS_Tagging_(State_of_the_art) - is there a particular reason that spaCy uses its own trained models rather than wrappers for existing models that report better accuracy numbers? I understand their reported accuracy might be on a completely different set of benchmarks, but have they been evaluated on spaCy's benchmarks? Are there licensing restrictions that keep them from being integrated as POS taggers? A parallel might be the vast number of pretrained language/embedding models implemented by HuggingFace's or TensorFlow Hub's repos, many of which are developed by people not directly associated with the repos themselves.

BramVanroy commented 4 years ago

I'm not sure if this is the place to ask, but I'm wondering, given the state of the art for POS tagging as reported by: https://aclweb.org/aclwiki/POS_Tagging_(State_of_the_art) - is there a particular reason that spaCy uses its own trained models rather than wrappers for existing models that report better accuracy numbers? I understand their reported accuracy might be on a completely different set of benchmarks, but have they been evaluated on spaCy's benchmarks? Are there licensing restrictions that keep them from being integrated as POS taggers? A parallel might be the vast number of pretrained language/embedding models implemented by HuggingFace's or TensorFlow Hub's repos, many of which are developed by people not directly associated with the repos themselves.

You are right that this is probably not the place to discuss this. You may be interested in the wrappers spacy-stanfordnlp, spacy-udpipe, spacy-transformers, and probably more. Benepar has a plugin for spaCy, too. SpaCy has many benefits over SOTA models imo. It's not just a model, it's a whole framework with customization options, the ability to train your own model and so on. Next to that, the benefit for me is spaCy's speed that can easily be massed on CPU using multiprocessing, and the fact that is open-source, and actively developed. Many if not most models or implentation that achieve (near) SOTA have been developed, benchmarked, and forgotten. That means that you can't get any help, no community, no bug fixes. Active development is my key concern in all of these SOTA models and frameworks.

lingdoc commented 4 years ago

Very cool! I completely agree that active development is a key concern, along with speed, right up there with accuracy of POS tags. I simply wasn't aware of the spacy-stanfordnlp wrapper or the benepar plugin, will definitely give them a look.

aced125 commented 4 years ago

Hi all - have been noticing lately the sentence boundary detection with the default parser on both small and medium models seems to be a bit more off than what I remembered in previous versions:

For example:

tmp_txt = """
He said: “Clearly there is a global demand for personal protective equipment at the moment and I know that government with our support is working night and day to ensure that we procure the PPE that we need.”

Turkey has sent 250,000 items of personal protective equipment to the UK which will now be distributed to medical centres around the country, according to the Ministry of Defence.

A delivery of 50,000 N-95 face masks, 100,000 surgical masks, and 100,000 protective suits arrived at RAF Brize Norton in Oxfordshire on Friday. Turkey has previously donated supplies to fellow Nato members Spain and Italy.

Ben Wallace, UK defence secretary, said the “vital equipment” from Ankara would bring protection and relief to thousands of critical workers across the UK.
"""

tmp_doc = nlp(tmp_txt)

print([sent for sent in tmp_doc.sents])

Messes up the SBD in the first sentence. Happens on both small and medium models (en_core_web_sm and en_core_web_md).

lingdoc commented 4 years ago

Interesting. It looks like the dependency parser doesn't handle conjoined clauses terribly well with a following 1st person pronoun. This is clearer with the raw text output:

print([x.text for x in tmp_doc.sents])

['\nHe said: “Clearly there is a global demand for personal protective equipment
 at the moment',  'and I know that government with our support is working night
 and day to ensure that we procure the PPE that we need.”\n\n', 'Turkey has sent
 250,000 items of personal protective equipment to the UK which will now be
 distributed to medical centres around the country, according to the Ministry of
 Defence.\n\n', 'A delivery of 50,000 N-95 face masks, 100,000 surgical masks,
 and 100,000 protective suits arrived at RAF Brize Norton in Oxfordshire on Friday.',
 'Turkey has previously donated supplies to fellow Nato members Spain and
 Italy.\n\n', 'Ben Wallace, UK defence secretary, said the “vital equipment” from
 Ankara would bring protection and relief to thousands of critical workers across
 the UK.\n']

In this case the mis-SBD seems to be caused by the second clause in the conjoined sentence starting with the pronoun I, which apparently is interpreted by the parser as the subject of a separate sentence rather than the subject of a conjoined clause. This is clearer from the following toy example:

tmp_txt = """The man had a dog who liked to run and he liked to chase the cat.
The man had a dog who liked to run and I liked to chase him.
The man had a dog who liked to run and I liked to chase the cat."""

tmp_doc = nlp(tmp_txt)

print([x.text for x in tmp_doc.sents])

gives:

['The man had a dog who liked to run and he liked to chase the cat.\n',
 'The man had a dog who liked to run', 'and I liked to chase him.\n',
 'The man had a dog who liked to run', 'and I liked to chase the cat.']

where the second and third sentences gets split because of the same pattern, even though sentence 2 has an anaphoric pronoun that refers to an element of the previous clause.

fersarr commented 4 years ago

I have noticed a few weird (incorrect) changes after upgrading from 2.0.18 to 2.2.4. Should I report those here? For example, the sentence make me a sandwich. I guess this is explainable by assuming it confuses cake as a noun vs cake as a verb

v2.0.18 Screen Shot 2020-04-23 at 6 18 52 PM

v2.2.4 Screen Shot 2020-04-23 at 6 19 15 PM

adrianeboyd commented 4 years ago

@fersarr: It's useful to have these kinds of results here, thanks! Imperatives are a case where the provided models often perform terribly because there are very few imperatives in the training data. If you know you have an imperative sentence, it's hacky, but adding a subject like we or you at the beginning of a sentence can improve the analysis a lot. (See some discussion in #4744.)

It would be nice to extend our training data in areas where we know there are problems because most of the models are trained on more formal text like newspaper text, but we don't have any concrete plans in this area yet. (Some common problems are: questions, imperatives, 1st and 2nd person (informal) pronouns, female pronouns, etc.)

fersarr commented 4 years ago

@fersarr: It's useful to have these kinds of results here, thanks! Imperatives are a case where the provided models often perform terribly because there are very few imperatives in the training data. If you know you have an imperative sentence, it's hacky, but adding a subject like we or you at the beginning of a sentence can improve the analysis a lot. (See some discussion in #4744.)

It would be nice to extend our training data in areas where we know there are problems because most of the models are trained on more formal text like newspaper text, but we don't have any concrete plans in this area yet. (Some common problems are: questions, imperatives, 1st and 2nd person (informal) pronouns, female pronouns, etc.)

Thanks @adrianeboyd for the link to #4744 and the interesting idea to add we or you to the imperative. Unfortunately, it didn't change the outcome in this case 😞. I will think of alternatives