Open ines opened 5 years ago
Doing an annotation project for pre-trained model with prodigy looks like a really good idea! Do you have some ideas when it could happens and who will be able to participate?
From #3070: English models predict empty strings as tags (confirmed also in nightly).
>>> nlp = spacy.load("en_core_web_sm")
>>> doc = nlp("I like London and Berlin")
>>> [(t.tag_, t.pos_) for t in doc]
[('PRP', 'PRON'), ('VBP', 'VERB'), ('', 'SPACE'), ('NNP', 'PROPN'), ('CC', 'CCONJ'), ('NNP', 'PROPN')]
From #2313: Similar problem in French (confirmed also in nightly).
>>> nlp = spacy.load("fr_core_news_sm")
>>> doc = nlp("Nous a-t-elle précisé ceci?")
>>> [x.pos_ for x in doc] # ['PRON', '', 'PART', 'PRON', 'VERB', 'PRON', 'PUNCT']
['PRON', '', 'ADV', 'PRON', 'VERB', 'PRON', 'PUNCT']
>>> doc = nlp("Nous a t-elle précisé ceci?")
>>> [x.pos_ for x in doc] # ['PRON', 'AUX', 'NOUN', 'VERB', 'PRON', 'PUNCT']
['PRON', 'AUX', 'VERB', 'VERB', 'PRON', 'PUNCT']
@mauryaland I hope we can have annotations starting in January. The first data to be annotated will be English and German, with other annotation projects hopefully starting fairly quickly.
We'll probably be hiring annotators to most of the work. We might do a little bit of "crowd sourcing" as a test, but we mostly believe annotation projects run better with fewer annotators. What we would benefit from is having one person per treebank overseeing the work, communicating with the annotators, and making language-specific annotation policy decisions.
I am trying to upgrade from 2.0.x to 2.1 but seeing different results for small English model. It is not meaningful come to conclusion case by case but I see accuracy decreased with some POS and Dependency tagging.
In general verb become noun. Dependencies lost or changed. Especially this one: (2.0 )tight/VB [advmod] tight/RB (2.1) tight/VB [acomp] tight/RB > to be able to acomp tight should be adjective I guess.
Should I assume new models (2.1.0a5) will change when 2.1 comes to release or we should not expect change.
VB > NN
doubles/VB > NNS
From the restaurant, the Seventh's boundary doubles back east along the Pennsylvania Turnpike.
hooking/VB > NN
No hooking up with college kids.
sit tight
tight/VB [advmod] tight/RB > advmod > (2.1) acomp (adjectival complement but tight is adverb)
As for Russia's sovereign debt, most investors are sitting tight, believing Washington will not bar investors from it, even if the U.S.
speed [compound] skating > no dep (2.1)
Yes, but all the Dutch medals are in speed skating only.
cross [acl] examined > no dep
Mr Goodwin is due to be cross examined on 8 June, the day of the general election.
test [dep] fly > dependency reversed They hope to test-fly their craft at Clow International Airport.
cross [npadvmod] examines
(2.1) dependencies connected over punctiation "-"
The witness was cross-examined by the defense.
Hi guys !
Just a quick question regarding the missing tags issue mentionned above (From #2313: Similar problem in French (confirmed also in nightly).) : does this come from the models ? Are you working on this ? In case it can help, I am adding examples with missing tags :
Thank you!
I'm not sure whether this belongs here or in its own issue, but I noticed that the tagger in spacy 2.1 en_core_web_md (2.1.0) seems to have some major problems.
I ran a quick evaluation on the PTB tags in UD_English-PUD with the following results (without normalizing punctuation tags, so the actual results would be a bit higher):
Model Tagging Acc.
------------------
sm 0.945
md 0.792
lg 0.952
The performance is similar for UD tags and for other corpora. With spacy 2.0, the results for all three models are similar.
I suspect these problems are what led to this hacky modification to a model test case, which now doesn't catch the error it's supposed to catch:
Below are simplified confusion matrices for the more frequent non-punctuation tags for md vs lg, where you can see that something has gone wrong in the md model (sm looks similar to lg). I was hoping to see a clear pattern that explained the errors (like two consistently swapped tags), but it's so all over the place that my first guess would be that there was an offset error for some portion of the training data.
Model (at least "en_core_web_sm") fails in prediction whenever capitalization is not correctly used. For example, comparing the predictions for "j.k. rowling wishes snape happy birthday in the most magical way" and "J.K. Rowling Wishes Snape Happy Birthday In The Most Magical Way" https://puu.sh/DiNTh/d3b940ef65.png
First has "rowling" considered as a verb, despite wishes being a verb. Second tend too easily to assign NNP and assigning "ROOT" Best version would be "J.K. Rowling wishes Snape happy birthday in the most magical way", which still makes "Snape" ent_typeB GPE.
This kind of errors are costant every time a supposedly capitalized name ( e.g "United States" ) isn't capitalized or a supposedly non - capitalized name is capitalizaed. This gives problem applying the model with, for example, headlines (which have all words first letters capitalized)
In [6]: nlp('acetaminophen')[0].tag, nlp('acetaminophen')[0].pos
Out[6]: ('UH', 'INTJ')
Hi, Lemma for "multiplies" should be "multiply" right?
(Pdb) tmp = nlp(u"A rabbit multiplies rapidly by having lots of sex.") (Pdb) tmp A rabbit multiplies rapidly by having lots of sex. (Pdb) [token.lemma_ for token in tmp] [u'a', u'rabbit', u'multiplie', u'rapidly', u'by', u'have', u'lot', u'of', u'sex', u'.']
Sentence tokenisation issue? I thought the nonstandard lexis might be the cause, but normalising it still gives pretty unusual sentence tokenisation:
>>> import spacy
>>> nlp = spacy.load(en)
>>> s = 'Me and you are gonna have a talk. \nSez who? \nSez me. \nHey! What did I say?'
>>> doc = nlp(s)
>>> for sent_index, sent in enumerate(doc.sents, start=1):
... print(sent_index, sent.text)
1 Me and you are gonna have a talk.
2 Sez
3 who?
Sez me.
4 Hey!
5 What did I say?
>>> s = 'Me and you are gonna have a talk. \nSays who? \nSays me. \nHey! What did I say?'
>>> doc = nlp(s)
>>> for sent_index, sent in enumerate(doc.sents, start=1):
... print(sent_index, sent.text)
1 Me and you are gonna have a talk.
2 Says who?
Says me.
3 Hey!
4 What did I say?
Another incorrect lemma_
import spacy nlp = spacy.load('en_core_weblg') doc = nlp("the Greys") [token.lemma for token in doc] ['the', 'Greys']
"Greys" should be "Grey"
another lemma_ conflict
import spacy nlp = spacy.load('en_core_weblg') [token.lemma for token in nlp("to be flattered by sth")] ['to', 'be', 'flatter', 'by', 'sth'] # correct [token.lemma_ for token in nlp("to feel flattered that")] ['to', 'feel', 'flattered', 'that'] #error
The second "flattered" should be "flatter"
I've noticed the tokenization and entity recognition around compact numbers (e.g., 10k
, 20M
) can be a bit of a mixed bag. Here's a tiny snippet:
from pprint import pprint
from spacy.matcher import Matcher
import en_core_web_sm
import spacy
print(f"Spacy version: {spacy.__version__}")
# Spacy version: 2.1.4
nlp = en_core_web_sm.load()
doc = nlp("Compact number formatting: 5k 5K 1m 1M")
pprint([(t.text, t.ent_type_) for t in doc])
# [('Compact', ''),
# ('number', ''),
# ('formatting', ''),
# (':', ''),
# ('5k', 'CARDINAL'),
# ('5', 'CARDINAL'),
# ('K', 'ORG'),
# ('1', 'ORG'),
# ('m', ''),
# ('1', 'CARDINAL'),
# ('M', '')]
Small English model 2.1 assigns pos_= VERB to "cat". Medium model works fine. Small and medium models 2.0 work fine, too. I thought I'd report this despite being a single word inaccuracy, since the example is from https://course.spacy.io/ from chapter1_03_rule-based-matching.md Section "Matching other token attributes"
import spacy
# Import the Matcher
from spacy.matcher import Matcher
# Load a model and create the nlp object
nlp = spacy.load('en_core_web_sm')
# Initialize the matcher with the shared vocab
matcher = Matcher(nlp.vocab)
# Add the pattern to the matcher
pattern = [
{'LEMMA': 'love', 'POS': 'VERB'},
{'POS': 'NOUN'}
]
matcher.add('loving', None, pattern)
# Process some text
doc = nlp("I loved dogs but now I love cats more.")
for token in doc:
print(token.text, token.lemma_, token.pos_, token.dep_, token.head.text)
I -PRON- PRON nsubj loved loved love VERB ROOT loved dogs dog NOUN dobj loved but but CCONJ cc loved now now ADV advmod love I -PRON- PRON nsubj love love love VERB conj loved cats cat VERB dobj love more more ADV advmod love . . PUNCT punct love
Another issue where there is also a problem with the small English model 2.1, but not with medium model is related to https://github.com/explosion/spaCy/issues/3305.
import spacy
# Load a model and create the nlp object
nlp = spacy.load('en_core_web_sm')
# Import the Doc and Span classes
from spacy.tokens import Doc, Span
# The words and spaces to create the doc from
words = ['Hello', 'world', '!']
spaces = [True, False, False]
# Create a doc manually
doc = Doc(nlp.vocab, words=words, spaces=spaces)
# Create a span manually
span = Span(doc, 0, 2)
# Create a span with a label
label = nlp.vocab.strings['GREETING']
print (label)
span_with_label = Span(doc, 0, 2, label=label)
# Add span to the doc.ents
doc.ents = [span_with_label]
12946562419758953770
ValueError Traceback (most recent call last)
in 21 print (label) 22 ---> 23 span_with_label = Span(doc, 0, 2, label=label) 24 25 # Add span to the doc.ents span.pyx in spacy.tokens.span.Span.__cinit__() ValueError: [E084] Error assigning label ID 12946562419758953770 to span: not in StringStore.
@Gnuelp Sorry for the late response here. That's a slightly tricky case where nlp.vocab.strings
differentiates between just looking up a string vs. adding a new string to the StringStore
. It will work with:
label = nlp.vocab.strings.add('GREETING')
As of spacy 2.1 this is simplified and you can just specify the label as a string in the Span
:
span_with_label = Span(doc, 0, 2, label='GREETING')
When I updated to 2.2
, I noticed that the pre-trained POS tagger started automatically tagging all sentence-initial nouns as PROPN
instead of NOUN
as it did before. This is a good heuristic for some datasets, but it threw off my pipeline (my data contains mostly bare noun labels like "man with a newspaper" or "spy", where this behavior was unexpected.)
@hawkrobe Interesting observation! This is an unexpected side effect of trying to make the models less sensitive to capitalization overall, with the intention of improving performance on data that doesn't look like formally edited newspaper text. The main difference to the 2.1 models is that some training sentences are randomly lowercased. Since your data is not really the kind of data spacy's models are intended for, I don't think it makes to try to optimize general-purpose models for this case (although it's still very useful to be made aware of these kinds of changes in behavior!).
One possible workaround is to use a framing sentence that you can insert your phrases into that looks more like newspaper text. Something like:
"The president saw the [bare NP] yesterday."
Then you are much more likely to get the correct analysis and you can extract the annotation that you need.
@adrianeboyd aha, I appreciate the info about the changes that led to this side effect, and I love the idea of using the framing sentence to bring the bare NPs closer to in-sample text.
Hi,
I have noticed some inconsistencies with Lemmatizer and Stop Words for the Italian model. I don't know if this is the best place where to report it but, please, forgive me, I am not very expert about ML, DL, models and so on, and I am learning. In particular I am not sure if lemmatization is perfomerd by the pretrained model or what else. Currently I am following this tutorial "Classify text using spaCy" to test and understand what spaCy is capable of.
I can summarize my issue with some code.
import spacy
from spacy.lang.it.stop_words import STOP_WORDS
nlp = spacy.load('it_core_news_sm')
sbd = nlp.create_pipe('sentencizer')
nlp.add_pipe(sbd, before="parser")
stop_words = STOP_WORDS
doc = 'Ciao marco, ieri ero sul lavoro per questo non ti risposi!'
# Create list of word tokens
token_list = []
for token in doc:
token_list.append(token.text)
print(token_list)
# create list of sentence tokens
sents_list = []
for sent in doc.sents:
sents_list.append(sent.text)
print(sents_list)
filtered_words = []
for word in doc:
if word.is_stop is False:
filtered_words.append(word)
print('filtered_words:', filtered_words)
# finding lemma for each word
for word in doc:
print(word.text, word.lemma_)
This is the result.
['Ciao', 'marco', ',', 'ieri', 'ero', 'sul', 'lavoro', 'per', 'questo', 'non', 'ti', 'risposi', '!']
['Ciao marco, ieri ero sul lavoro per questo non ti risposi!']
filtered_words: [Ciao, marco, ,, risposi, !]
Ciao Ciao
marco marcare
, ,
ieri ieri
ero essere
sul sul
lavoro lavorare
per per
questo questo
non non
ti ti
risposi risposare
! !
First of all a translation of the text. The text means: "Hello marco, yesterday I was at work for this I didn't answer you!".
Then I can say that tokenization is correct.
Subdivision in sentences is not very important to me but I think it is correct even if I am not sure if "sentence" is synonymous of "period", if yes, it's correct; if not and "sentencizer" is assumed to split also the parts of a "period", it is wrong.
True problems come with Stop Words and Lemmatizer.
I am not sure why Stop Words include especially "lavoro" ("work") and, eventually, "ieri" ("yesterday"). Can any Topic Detector extract any valuable meaning from only: [Ciao, marco, ,, risposi, !]
? That is "Hello", the name "Marco" and the verb "to answer"...
Finally Lemmatization gives probably the worst results.
"Marco" is a name; "marcare" is a verb and means "to mark". And "risposi" means "I answered to you" and doesn't stem to "risposare" but "rispondere". "risposare" means "to marry again".
If I can fix such errors someway by myself let me know. I would really like to do it if I can. Otherwise hope that this can be of any help.
Any explanation or help is welcome, I am quite ignorant about this kind of softwares at the moment.
Thank you
@endersaka: This is the correct place, thanks!
For languages where we don't have a better rule-based lemmatizer, spacy uses lemma lookup tables, which can be better than not having a lemmatizer, but they aren't great. There are some errors in the tables and the results are often not so great because it can't provide multiple lemmas for words like "risposi" or disambiguate based on POS tags. It looks like "risposi" is ambiguous (if you don't have any other context), so the simple lookup table is never going to handle this word correctly in all cases. It looks like "Marco" returns "Marco" and "marco" returns "marcare".
The one advantage of simple lookup tables is that they are easy to understand and modify. The table is here: https://github.com/explosion/spacy-lookups-data/blob/master/spacy_lookups_data/data/it_lemma_lookup.json
The stop words, which look like they've been unmodified since 2016, are here: https://github.com/explosion/spaCy/blob/master/spacy/lang/it/stop_words.py
Stop words are often pretty task-specific, so it's hard to provide a perfect list for everyone. Here's how to modify the stop words: https://stackoverflow.com/a/51627002/461847
We're always happy to get pull requests or other updates from people with more expertise in a particular language! If you potentially have a better source for Italian stop words, let us know...
I've opened an issue before I saw that this thread existed. You can find it here: #4611
@adrianeboyd sincerely thanks for the explanation. Currently I am taking a look at the documentation on spaCy Web site and more precisely to the Adding Languages page to understand the architecture.
About "risposi", yes! I confirm. It's a very special case. Actually the verb token of the character sequence "non ti risposi" can have two different meanings in two different tenses, depending on the context. In my example case the verb means "to answer" coniugated in simple past tense (that tense, in Italian, translates to "passato remoto", literally "remote past"); "ti", of course, is the pronoun "you" the object of the sentence, receiving the action. "non" is negation. In the case produced by the lemmatizer spaCy takes in to account the other meaning in which the verb is reflective with the help of the pronoun "ti" (also called pronominal verb). In English such reflective form (literally translated) would mean (to be as precise as possible) "you didn't marry yourself once again" :-D
Apart this, funny and interesting matter, I will read carefully the spaCy documentation and try to understand how the Lemmatizer works to eventually propose some modification in the future.
@endersaka spacy models are not caseless. The same happens with the english model: basically it won't analyze a phrase with wrong casing well. Either train another model with a caseless dataset or try a truecaser if your dataset has wrong casing.
@killia15 The problem is that there are not any occurrences of the word "tu" in the training data that we're using for French, which comes from an older release of this corpus: https://github.com/UniversalDependencies/UD_French-Sequoia/ . The sources are (according to their README): Europarl, Est Republicain newspaper, French Wikipedia and European Medicine Agency.
This is a relatively common problem in corpora that are based on formally edited texts like newspapers or encyclopedia-style texts like wikipedia. I see "vous" but not "tu". This is clearly not great!
Hopefully in the future we can train models with more/better data that won't have this problem. The newest version of the UD GSD corpora, which was released last week, have dropped the non-commercial restriction, so that can potentially provide more data for French (and a few other languages). If you're interested in training a model for French with UD_French-GSD to use now, I can provide a sketch of how to convert and train, which is pretty easy with spacy's CLI commands.
@adrianeboyd That would certainly explain it! Though great timing with the UD GSD corpora. I would be very interested in training the model. Let me know how I can help. My goal for spacy is to use it for a project where we’re automatically analyzing French texts (news articles, blog posts, poems, passages from books etc) to predict which vocabulary and grammar structures a student will and won’t know so we can make recommendations to their instructor on what they should be working on. Our goal is to publish a paper on it so we’re certainly invested in the success of Spacy’s French model!
@killia15:
The current spacy release doesn't handle subtokens in a particularly good way (you only see the subtoken strings like de les
rather than des
), but you can convert the data and train a model like this:
spacy convert -n 10 -m train.conllu .
spacy convert -n 10 -m dev.conllu .
spacy train fr outout_dir train.json dev.json -p tagger,parser
After converting the data, you'll have tags that look like this:
NOUN__Gender=Masc|Number=Sing
After converting and before training, make sure the current lang/fr/tag_map.py
has the tags you need. The current tag map just maps to the UD tags like this, so if you don't need the morphological features, you'll just need to check that none are missing (you may have some new combinations of morphological features):
"NOUN__Gender=Masc|Number=Sing": {POS: NOUN},
If you'd like better access to the morphological features (not just as a clunky token.tag_
string), you can expand the mapping to include the features:
"NOUN__Gender=Masc|Number=Sing": {POS: NOUN, 'Gender': 'Masc', 'Number': 'Sing'},
Spacy supports the UD morphological features, so you should be able to do this automatically from a list of the tags in the converted training data. (In the future there should be a statistical morphological tagger, but for now the morphological features are just mapped from the part-of-speech tags.)
As reported here, I encounter some very suspicious parses when using the GPU. Using the CPU version works as expected. Below you find some examples, first the source sentence followed by its tokenisation (which looks alright), then the POS (which look off), and finally DEP (which seems random/uninitialized):
s = "The decrease in 2008 primarily relates to the decrease in cash and cash equivalents 1.\n"
['The', 'decrease', 'in', '2008', 'primarily', 'relates', 'to', 'the', 'decrease', 'in', 'cash', 'and', 'cash', 'equivalents', '1', '.', '\n']
['VERB', 'PRON', 'PROPN', 'NOUN', 'VERB', 'ADV', 'VERB', 'NUM', 'PRON', 'NOUN', 'VERB', 'PROPN', 'PROPN', 'VERB', 'VERB', 'NOUN', 'SPACE']
['dep', 'dep', 'dep', 'dep', 'dep', 'dep', 'dep', 'dep', 'dep', 'dep', 'dep', 'dep', 'dep', 'dep', 'dep', 'ROOT', '']
s = "The Company's current liabilities of €32.6 million primarily relate to deferred income from collaborative arrangements and trade payables.\n"
['The Company', "'s", 'current', 'liabilities', 'of', '&', 'euro;32.6', 'million', 'primarily', 'relate', 'to', 'deferred', 'income', 'from', 'collaborative', 'arrangements', 'and', 'trade', 'payables', '.', '\n']
['NOUN', 'VERB', 'AUX', 'NOUN', 'NOUN', 'PROPN', 'PROPN', 'PROPN', 'VERB', 'VERB', 'ADV', 'VERB', 'VERB', 'NOUN', 'NOUN', 'PROPN', 'NOUN', 'PROPN', 'VERB', 'NUM', 'NOUN', 'SPACE']
['dep', 'dep', 'dep', 'dep', 'dep', 'dep', 'dep', 'dep', 'dep', 'punct', 'dep', 'dep', 'dep', 'dep', 'dep', 'dep', 'dep', 'dep', 'dep', 'dep', 'ROOT', '']
s = 'The increase in deferred income is related to new deals with partners.\n'
['The', 'increase', 'in', 'deferred', 'income', 'is', 'related', 'to', 'new', 'deals', 'with', 'partners', '.', '\n']
['NOUN', 'PROPN', 'PROPN', 'VERB', 'NOUN', 'NOUN', 'NOUN', 'VERB', 'ADV', 'VERB', 'NOUN', 'VERB', 'NOUN', 'SPACE']
['dep', 'dep', 'dep', 'dep', 'dep', 'dep', 'dep', 'dep', 'dep', 'punct', 'dep', 'dep', 'ROOT', '']
Example repo with data here. Note that the issue does not seem to occur on Linux but only on Windows and only when using the GPU.
@BramVanroy I think this is probably a cupy issue. Their disclaimer:
We recommend the following Linux distributions.
Ubuntu 16.04 / 18.04 LTS (64-bit) CentOS 7 (64-bit)
We are automatically testing CuPy on all the recommended environments above. We cannot guarantee that CuPy works on other environments including Windows and macOS, even if CuPy may seem to be running correctly.
Figured so. Perhaps a (HUGE) disclaimer is welcome in the docs, then. Discouraging people from using the GPU on Windows. If you agree, I can create a pull request.
I think a warning in the docs would be good. This hasn't come up before, so either it's rare for people to be using a GPU in windows or something changed in thinc/cupy to cause this. Can you see if it does work with slightly older versions of thinc and/or cupy? (We have no way to test this ourselves.)
I think that indeed not a lot of people are using Windows (do you have any stats on this from PyPi perhaps? Would be interesting to see!), but also I don't think it is very well known that GPU-support is available, simply because the CPU-performance is so incredibly good. In my projects I use the CPU version and parallellize it and I never felt like I missed performance, so I never went looking for GPU support. I only recently bumped into it, tried it out on my home PC, and found out it didn't work.
When I find the time, I can dig into this deeper and try older versions. I think I tried down to v2.0 (which also didn't work) but I'll have to check. (It might be useful to re-open the linked topic so I can keep it updated rather than flooding this topic.)
I think a new issue focused on windows + GPU would be useful. I didn't mean older versions of spacy, just older versions of cupy and maybe thinc (within the compatibility ranges, of course).
Even though I would definitely like to see full blown GPU support for Windows, I'm not sure whether this is something that spaCy can fix if the problem lies in cupy? But if requested I can make a new issue, sure.
We can't necessarily fix it, but if we (well, you) can figure out that a particular version of cupy works better, we can provide that information in the warning. A new issue could also help people with the same problem when they search in the tracker, since you're right that it's getting pretty off-topic here. (Maybe we can just move all these comments to a new issue?)
I am having problems validating the accuracy of the nl_core_news_sm
when I am trying to run it on the Lassy Small Test Dataset.
I see the model is trained on the same dataset and I am assuming its just the training data but I see the accuracy mentioned in the Github Release is 90.97%
for tagging accuracy and when I try to run the same model on the same dataset, I am getting an accuracy of 75.12%
.
Github: https://github.com/umarniz/spacy-validate-nl-model
Can you confirm if this is the same dataset the Spacy model is tested on?
Secondly, I was unable to find the code that is used to calculate the accuracy number that are attached in the Github Release. Is there a place where the code used to calculate those numbers is available?
I can imagine sharing the datasets they are run on might not be legal but the code could be useful for people who obtain the dataset themselves to validate :)
@umarniz That looks like a mistake in the docs, sorry for the confusion! The NER model is trained on NER annotation done on top of UD_Dutch-LassySmall, but the tagger and parser are trained on UD_Dutch v2.0 (since renamed to UD_Dutch-Alpino).
Spacy calculates the accuracy using the internal scorer. You can run an evaluation on a corpus in spacy's training format using the command-line spacy evaluate model corpus.json
, which runs nlp.evaluate()
underneath.
You can convert a CoNLL-U corpus to spacy's training format with combined tag+morph tags using the command: spacy convert -n 10 -m file.conllu .
Be aware that the reported tagger evaluation is for token.tag
not token.pos
and is for the UD dev subcorpus rather than the test subcorpus. (It's somewhat confusing labelled POS on the website, but see the note in the mouseover.)
More examples of hyphen being tagged incorrectly (like in 4974): pre-primary co-pay ex-wife ex-mayor de-ice
Verb "to be" is being marked as AUX instead of VERB when it is actually the main verb.
If I use displacy on this sentence "I have been in Wuhan" why do I see the POS "AUX" on been? Isn't it a verb? https://explosion.ai/demos/displacy?text=I%20have%20been%20in%20Wuhan&model=en_core_web_lg&cpu=1&cph=1
It happens on all the pretrained English models
I hope this is the right place to report some confusing behavior where spaCy 2.2.3 and en_core_web_md 2.2.5 on Python 3.7 seem to produce a different lemma and part-of-speech tag when a noun is capitalized at the beginning of a sentence. I've minimized an example with the word "time", but I have seen what appears to be the same issue with the words "psychoanalysis" and "interpretation", at least. This program:
import en_core_web_md
srcs = ["An historian employs most of these words at one time or another.",
"Our first task is to understand our own times.",
"Time is therefore that mediating order.", # problem!
"Times are changing."]
nlp = en_core_web_md.load()
rslts = [[{"lemma": t.lemma_, "tag": t.tag_}
for t in doc if "time" == t.norm_[0:4]][0]
for doc in nlp.pipe(srcs)]
if __name__ == "__main__":
import sys
import json
json.dump(rslts, sys.stdout, indent=1, sort_keys=True)
sys.stdout.write("\n")
produces the following output:
[
{
"lemma": "time",
"tag": "NN"
},
{
"lemma": "time",
"tag": "NNS"
},
{
"lemma": "Time",
"tag": "NNP"
},
{
"lemma": "time",
"tag": "NNS"
}
]
I expected the use of "time" in the third sentence ("Time is therefore that mediating order.") to be lematized as "time"
and tagged as "NN"
, consistent with the other examples.
@LiberalArtist : The v2.2. models are using some new data augmentation to try to make them less case-sensitive, which leads to less certainty about NN
vs. NNP
distinctions, and for Time
in particular, the training data includes lots of cases of Time
as NNP
from the magazine Time
or Time Warner
.
@MartinoMensio: The POS tags are mapped from the fine-grained PTB tag set, which doesn't distinguish auxiliary verbs from main verbs. All verbs get mapped to VERB
except for some exceptions below, where everything that might be an AUX
gets mapped to AUX
:
This mapping is kind of crude, and we're working on statistical models for morphological tagging and POS tags to replace this.
I'm not sure if this is the place to ask, but I'm wondering, given the state of the art for POS tagging as reported by: https://aclweb.org/aclwiki/POS_Tagging_(State_of_the_art) - is there a particular reason that spaCy uses its own trained models rather than wrappers for existing models that report better accuracy numbers? I understand their reported accuracy might be on a completely different set of benchmarks, but have they been evaluated on spaCy's benchmarks? Are there licensing restrictions that keep them from being integrated as POS taggers? A parallel might be the vast number of pretrained language/embedding models implemented by HuggingFace's or TensorFlow Hub's repos, many of which are developed by people not directly associated with the repos themselves.
I'm not sure if this is the place to ask, but I'm wondering, given the state of the art for POS tagging as reported by: https://aclweb.org/aclwiki/POS_Tagging_(State_of_the_art) - is there a particular reason that spaCy uses its own trained models rather than wrappers for existing models that report better accuracy numbers? I understand their reported accuracy might be on a completely different set of benchmarks, but have they been evaluated on spaCy's benchmarks? Are there licensing restrictions that keep them from being integrated as POS taggers? A parallel might be the vast number of pretrained language/embedding models implemented by HuggingFace's or TensorFlow Hub's repos, many of which are developed by people not directly associated with the repos themselves.
You are right that this is probably not the place to discuss this. You may be interested in the wrappers spacy-stanfordnlp, spacy-udpipe, spacy-transformers, and probably more. Benepar has a plugin for spaCy, too. SpaCy has many benefits over SOTA models imo. It's not just a model, it's a whole framework with customization options, the ability to train your own model and so on. Next to that, the benefit for me is spaCy's speed that can easily be massed on CPU using multiprocessing, and the fact that is open-source, and actively developed. Many if not most models or implentation that achieve (near) SOTA have been developed, benchmarked, and forgotten. That means that you can't get any help, no community, no bug fixes. Active development is my key concern in all of these SOTA models and frameworks.
Very cool! I completely agree that active development is a key concern, along with speed, right up there with accuracy of POS tags. I simply wasn't aware of the spacy-stanfordnlp wrapper or the benepar plugin, will definitely give them a look.
Hi all - have been noticing lately the sentence boundary detection with the default parser on both small and medium models seems to be a bit more off than what I remembered in previous versions:
For example:
tmp_txt = """
He said: “Clearly there is a global demand for personal protective equipment at the moment and I know that government with our support is working night and day to ensure that we procure the PPE that we need.”
Turkey has sent 250,000 items of personal protective equipment to the UK which will now be distributed to medical centres around the country, according to the Ministry of Defence.
A delivery of 50,000 N-95 face masks, 100,000 surgical masks, and 100,000 protective suits arrived at RAF Brize Norton in Oxfordshire on Friday. Turkey has previously donated supplies to fellow Nato members Spain and Italy.
Ben Wallace, UK defence secretary, said the “vital equipment” from Ankara would bring protection and relief to thousands of critical workers across the UK.
"""
tmp_doc = nlp(tmp_txt)
print([sent for sent in tmp_doc.sents])
Messes up the SBD in the first sentence. Happens on both small and medium models (en_core_web_sm and en_core_web_md).
Interesting. It looks like the dependency parser doesn't handle conjoined clauses terribly well with a following 1st person pronoun. This is clearer with the raw text output:
print([x.text for x in tmp_doc.sents])
['\nHe said: “Clearly there is a global demand for personal protective equipment
at the moment', 'and I know that government with our support is working night
and day to ensure that we procure the PPE that we need.”\n\n', 'Turkey has sent
250,000 items of personal protective equipment to the UK which will now be
distributed to medical centres around the country, according to the Ministry of
Defence.\n\n', 'A delivery of 50,000 N-95 face masks, 100,000 surgical masks,
and 100,000 protective suits arrived at RAF Brize Norton in Oxfordshire on Friday.',
'Turkey has previously donated supplies to fellow Nato members Spain and
Italy.\n\n', 'Ben Wallace, UK defence secretary, said the “vital equipment” from
Ankara would bring protection and relief to thousands of critical workers across
the UK.\n']
In this case the mis-SBD seems to be caused by the second clause in the conjoined sentence starting with the pronoun I
, which apparently is interpreted by the parser as the subject of a separate sentence rather than the subject of a conjoined clause. This is clearer from the following toy example:
tmp_txt = """The man had a dog who liked to run and he liked to chase the cat.
The man had a dog who liked to run and I liked to chase him.
The man had a dog who liked to run and I liked to chase the cat."""
tmp_doc = nlp(tmp_txt)
print([x.text for x in tmp_doc.sents])
gives:
['The man had a dog who liked to run and he liked to chase the cat.\n',
'The man had a dog who liked to run', 'and I liked to chase him.\n',
'The man had a dog who liked to run', 'and I liked to chase the cat.']
where the second and third sentences gets split because of the same pattern, even though sentence 2 has an anaphoric pronoun that refers to an element of the previous clause.
I have noticed a few weird (incorrect) changes after upgrading from 2.0.18
to 2.2.4
. Should I report those here? For example, the sentence make me a sandwich
. I guess this is explainable by assuming it confuses cake
as a noun
vs cake
as a verb
v2.0.18
v2.2.4
@fersarr: It's useful to have these kinds of results here, thanks! Imperatives are a case where the provided models often perform terribly because there are very few imperatives in the training data. If you know you have an imperative sentence, it's hacky, but adding a subject like we
or you
at the beginning of a sentence can improve the analysis a lot. (See some discussion in #4744.)
It would be nice to extend our training data in areas where we know there are problems because most of the models are trained on more formal text like newspaper text, but we don't have any concrete plans in this area yet. (Some common problems are: questions, imperatives, 1st and 2nd person (informal) pronouns, female pronouns, etc.)
@fersarr: It's useful to have these kinds of results here, thanks! Imperatives are a case where the provided models often perform terribly because there are very few imperatives in the training data. If you know you have an imperative sentence, it's hacky, but adding a subject like
we
oryou
at the beginning of a sentence can improve the analysis a lot. (See some discussion in #4744.)It would be nice to extend our training data in areas where we know there are problems because most of the models are trained on more formal text like newspaper text, but we don't have any concrete plans in this area yet. (Some common problems are: questions, imperatives, 1st and 2nd person (informal) pronouns, female pronouns, etc.)
Thanks @adrianeboyd for the link to #4744 and the interesting idea to add we
or you
to the imperative. Unfortunately, it didn't change the outcome in this case 😞. I will think of alternatives
This thread is a master thread for collecting problems and reports related to incorrect and/or problematic predictions of the pre-trained models.
Why a master thread instead of separate issues?
GitHub now supports pinned issues, which lets us create master threads more easily without them getting buried.
Users often report issues that come down to incorrect predictions made by the pre-trained statistical models. Those are all good and valid, and can include very useful test cases. However, having a lot of open issues around minor incorrect predictions across various languages also makes it more difficult to keep track of the reports. Unlike bug reports, they're much more difficult to action on. Sometimes, mistakes a model makes can indicate deeper problems that occurred during training or when preprocessing the data. Sometimes they can give us ideas for how to use data augmentation to make the models less sensitive to very small variations like punctuation or capitalisation.
Other times, it's just something we have to accept. A model that's 90% accurate will make a mistake on every 10th prediction. A model that's 99% accurate will be wrong once every 100 predictions.
The main reason we distribute pre-trained models is that it makes it easier for users to build their own systems by fine-tuning pre-trained models on their data. Of course, we want them to be as good as possible, and we're always optimising for the best compromise of speed, size and accuracy. But we won't be able to ship pre-trained models that are always correct on all data ever.
For many languages, we're also limited by the resources available, especially when it comes to data for named entity recognition. We've already made substantial investments into licensing training corpora, and we'll continue doing so (including running our own annotation projects with Prodigy ✨) – but this will take some time.
Reporting incorrect predictions in this thread
If you've come across suspicious predictions in the pre-trained models (tagger, parser, entity recognizer) or you want to contribute test cases for a given language, feel free to submit them here. (Test cases should be "fair" and useful for measuring the model's general accuracy, so single words, significant typos and very ambiguous parses aren't usually that helpful.)
You can check out our new models test suite for spaCy
v2.1.0
to see the tests we're currently running.