Open ines opened 5 years ago
just FYI, another regression below. However, this one seems to only have happened after 2.2.0 because the spacy visualizer demo (2.2.0) shows it correctly
v.2.0.18
v.2.2.4
Noticed an example in which the small model fails but the medium model succeeds. murmured
is incorrectly tagged as a proper noun starting the sentence in the example below (perhaps explaining the lack of lemmatization). When a noun phrase precedes it, it's correctly parsed.
Small Model:
import spacy
nlp_sm = spacy.load("en_core_web_sm")
nlp_md = spacy.load("en_core_web_md")
sent = "murmured Nick in the library"
for token in nlp_sm(sent):
print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_, "\n")
murmured murmured PROPN NNP ROOT
Nick Nick PROPN NNP dobj
in in ADP IN prep
the the DET DT det
library library NOUN NN pobj
import spacy
nlp_sm = spacy.load("en_core_web_sm")
sent = "I murmured in the library"
for token in nlp_sm(sent):
print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_, "\n")
I -PRON- PRON PRP nsubj
murmured murmur VERB VBD ROOT
in in ADP IN prep
the the DET DT det
library library NOUN NN pobj
Medium Model
import spacy
sent = "murmured Nick in the library"
for token in nlp_md(sent):
print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_, "\n")
murmured murmur VERB VBD ROOT
Nick Nick PROPN NNP npadvmod
in in ADP IN prep
the the DET DT det
library library NOUN NN pobj
spacy 2.2.3 py37ha1b3eb9_0 conda-forge
Hello there,
I am wondering if there is a way to force the POS tagger to treat tokens as non-verbs in order to not mess up the dependency parser. In my case, I have as input a long list of noun chunks, hence no verbs are expected to occur in my input. I noticed that for some cases the POS tagger gets confused:
import spacy
nlp = spacy.load('en_core_web_lg')
chunks = ['reading light', 'flashing light']
for chunk in chunks:
doc = nlp(chunk)
for token in doc:
print(token.text, token.dep_, token.tag_)
print('-'*10)
yields:
reading ROOT VBG
light dobj NN
----------
flashing ROOT VBG
light dobj NN
while the expected output would be that in both chunks the ROOT is "light". So, can I hint the tagger that I am giving it something that can't be verb-ish? That way the parser would not fail, I presume.
Thanks!
I'm not sure if this is the place to ask, but I'm wondering, given the state of the art for POS tagging as reported by: https://aclweb.org/aclwiki/POS_Tagging_(State_of_the_art) - is there a particular reason that spaCy uses its own trained models rather than wrappers for existing models that report better accuracy numbers? I understand their reported accuracy might be on a completely different set of benchmarks, but have they been evaluated on spaCy's benchmarks? Are there licensing restrictions that keep them from being integrated as POS taggers? A parallel might be the vast number of pretrained language/embedding models implemented by HuggingFace's or TensorFlow Hub's repos, many of which are developed by people not directly associated with the repos themselves.
My understanding is that spaCy defaults to trading some accuracy for speed. Using the defaults means you get that compromise too. But spaCy is very hackable, so you can BYOM. They also link to (and I think, made) spaCy Stanza which is one way to use fancier models that are slower but more accurate.
All three German language models do not recognize Karlsruhe (a city) as LOC, but smaller cities.
import spacy
nlp_de = spacy.load('de_core_news_lg')
doc_de = nlp_de('Ettlingen liegt bei Karlsruhe.')
for entity in doc_de.ents:
print(entity.text, entity.label_)
Result:
Ettlingen LOC
Karlsruhe MISC
I noticed that the NER makes some mistakes when tagging text with money depending on the numerical value and I wonder if you could do something about it when training the next version of the models.
For instance, if I run the following notebook cell
import spacy
nlp = spacy.load('en_core_web_lg')
symbols = ["$", "£", "€", "¥"]
for symbol in symbols:
print("-----------------------")
print("Symbol: {}".format(symbol))
print("-----------------------")
for j in range (0, 2):
for i in range (1, 10):
text = nlp(symbol + str(j) + '.' + str(i) + 'mn favourable variance in employee expense over the forecast period')
print (str(j) + '.' + str(i))
for ent in text.ents:
print(ent.text, ent.start_char, ent.end_char, ent.label_)
I get the following results, where
-----------------------
Symbol: $
-----------------------
0.1
0.1mn 1 6 MONEY
0.2
0.2mn 1 6 MONEY
0.3
0.3mn 1 6 MONEY
0.4
0.4mn 1 6 MONEY
0.5
0.5mn 1 6 MONEY
0.6
0.6mn 1 6 MONEY
0.7
0.7mn 1 6 MONEY
0.8
0.8mn 1 6 MONEY
0.9
0.9mn 1 6 MONEY
1.1
1.2
1.2mn 1 6 MONEY
1.3
1.3mn 1 6 MONEY
1.4
1.4mn 1 6 MONEY
1.5
1.6
1.6mn 1 6 MONEY
1.7
1.7mn 1 6 MONEY
1.8
1.8mn 1 6 MONEY
1.9
1.9mn 1 6 MONEY
-----------------------
Symbol: £
-----------------------
0.1
0.1mn 1 6 MONEY
0.2
0.2mn 1 6 MONEY
0.3
0.3mn 1 6 MONEY
0.4
0.4mn 1 6 ORG
0.5
0.6
0.6mn 1 6 MONEY
0.7
0.7mn 1 6 MONEY
0.8
0.9
0.9mn 1 6 MONEY
1.1
1.2
1.2mn 1 6 MONEY
1.3
1.3mn 1 6 MONEY
1.4
1.4mn 1 6 MONEY
1.5
1.6
1.6mn 1 6 MONEY
1.7
1.7mn 1 6 MONEY
1.8
1.8mn 1 6 MONEY
1.9
1.9mn 1 6 MONEY
-----------------------
Symbol: €
-----------------------
0.1
0.1mn 1 6 MONEY
0.2
0.3
0.3mn 1 6 MONEY
0.4
0.5
0.6
0.6mn 1 6 MONEY
0.7
0.7mn 1 6 MONEY
0.8
0.9
1.1
1.2
1.2mn 1 6 MONEY
1.3
1.4
1.5
1.6
1.6mn 1 6 MONEY
1.7
1.7mn 1 6 MONEY
1.8
1.9
1.9mn 1 6 MONEY
-----------------------
Symbol: ¥
-----------------------
0.1
0.2
0.3
0.4
¥0.4mn 0 6 ORG
0.5
¥0.5mn 0 6 ORG
0.6
¥0.6mn 0 6 GPE
0.7
0.8
¥0.8mn 0 6 ORG
0.9
¥0.9mn 0 6 ORG
1.1
¥1.1mn 0 6 ORG
1.2
1.3
1.4
1.5
1.6
¥1.6mn 0 6 ORG
1.7
1.8
¥1.8mn 0 6 NORP
1.9
¥1.9mn 0 6 CARDINAL
Info about model 'en_core_web_lg'
(...)
version 2.3.1
spacy_version >=2.3.0,<2.4.0
Facing an interesting issue with the large (and small) pre trained models
import spacy nlp = spacy.load('en_core_web_lg')
text = "POL /hi there abcde ffff" doc = nlp(text) for ent in doc.ents: ... print(ent.text, ent.start_char, ent.endchar, ent.label) ... POL 0 3 PERSON
if I remove the "/" it does not detect any entities (the expected behavior) any idea why the leading slash throws NER off?
This is observed using spacy 2.2.3
Btw I also tried with the spacy nightly and TRF model, the issue does not exist with the transformer model
Spacy 3.0 tags “tummy” as a determiner in “tummy ache”. I think this is serious, since determiners are closed class words and arguably there should be literally half a dozen of them in English language and never more. Tagging any content word as a determiner is likely to cause many apps to ignore it.
Model: en_core_web_lg-3.0.0
In [1]: import spacy; nlp = spacy.load('en_core_web_lg')
In [2]: [tok.tag_ for tok in nlp('tummy ache')]
Out[2]: ['DT', 'NN']
Here's a perverse case. Is Will a first name or an AUX MD? spaCy 3.0.1 with en_core_web_sm gets confused.
nlp = spacy.load('en_core_web_sm')
doc = nlp("Will Will Shakespeare write his will?")
for token in doc:
print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
token.shape_, token.is_alpha, token.is_stop)
Will will AUX MD aux Xxxx True True
Will will AUX MD aux Xxxx True True
Shakespeare shakespeare VERB VB nsubj Xxxxx True False
write write VERB VB ROOT xxxx True False
his his PRON PRP$ poss xxx True True
will will NOUN NN dobj xxxx True True
? ? PUNCT . punct ? False False
Here's another perverse case. Is May a first name or an AUX MD? spaCy 3.0.1 with en_core_web_sm gets confused.
nlp = spacy.load('en_core_web_sm')
doc = nlp("May May celebrate May Day with us?")
for token in doc:
print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
token.shape_, token.is_alpha, token.is_stop)
May May PROPN NNP ROOT Xxx True True
May may AUX MD aux Xxx True True
celebrate celebrate VERB VB ROOT xxxx True False
May May PROPN NNP compound Xxx True True
Day Day PROPN NNP npadvmod Xxx True False
with with ADP IN prep xxxx True True
us we PRON PRP pobj xx True True
? ? PUNCT . punct ? False False
However, this similar sentence treats May differently:
doc = nlp("May May come over for dinner?")
for token in doc:
print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
token.shape_, token.is_alpha, token.is_stop)
May may AUX NNP aux Xxx True True
May may AUX MD aux Xxx True True
come come VERB VB ROOT xxxx True False
over over ADP RP prt xxxx True True
for for ADP IN prep xxx True True
dinner dinner NOUN NN pobj xxxx True False
? ? PUNCT . punct ? False False
and of course, there are a number of women's names that could cause a parser fits (Tom Switfy):
and some men's names:
While upgrading from spaCy 2.3.1 to 3.0.0, I've noticed that some person entities are no longer detected:
import en_core_web_lg
nlp = en_core_web_lg.load()
doc = nlp('I am not Mr Foo.')
print(doc.ents)
# (Mr Foo,) with 2.3.1
# () with 3.0.0
The Multilingual model xx_sent_ud_sm
does not tokenize correctly Chinese sentences, while the Chinese model zh_core_web_sm
does. For example:
import spacy
nlp_ml = spacy.load("xx_sent_ud_sm")
nlp_ml.tokenizer("包括联合国机构和机制提出的有关建议以及现有的外部资料对有关国家进行筹备性研究。")
# ['包括联合国机构和机制提出的有关建议以及现有的外部资料对有关国家进行筹备性研究', '。']
nlp_zh= spacy.load("zh_core_web_sm")
nlp_zh.tokenizer("包括联合国机构和机制提出的有关建议以及现有的外部资料对有关国家进行筹备性研究。")
# ['包括', '联合国', '机构', '和', '机制', '提出', '的', '有关', '建议', '以及', '现有', '的', '外部', '资料', '对', '有关', '国家', '进行', '筹备性', '研究', '。']
SpaCy version is 3.0.0
@Riccorl : This is the expected behavior for the base xx
tokenizer used in that model, which just doesn't work for languages without whitespace between tokens. It was a mistake to include the Chinese or Japanese training corpora in the xx_sent_ud_sm
3.0.0 model. They'll be omitted in the next release.
The zh_core_web_sm
model uses a completely separate tokenizer based on pkuseg
to do word segmentation.
@Riccorl : This is the expected behavior for the base
xx
tokenizer used in that model, which just doesn't work for languages without whitespace between tokens. It was a mistake to include the Chinese or Japanese training corpora in thexx_sent_ud_sm
3.0.0 model. They'll be omitted in the next release.The
zh_core_web_sm
model uses a completely separate tokenizer based onpkuseg
to do word segmentation.
Clear. Thank you for the explanation.
There is a consistent issue with two-word adjectives in English language. Hyphenated two-word adjectives with prepositions are tokenized apart and the POS tagging model is unable to recognize them as an adjective. This causes the model to fail when extracting noun chunks:
To my understanding
on-board
should be identified as ADJ
and its dependency to charger
should be amod
, which would not break the noun chunk dependency tree.
Another example:
Two-word adjectives not starting with prepositions are properly detected:
I tried the following pipelines and all have the same issue:
en_core_web_sm
en_core_web_lg
en_core_web_trf
@ezorita Had a quick look at the training data (OntoNotes) and it looks like "for-profit" is consistently annotated as a prepositional phrase with three tokens, while "non-profit" is a single token adjective. So it looks like this may just be a quirk of our training data.
Hi, I hope this is the right place to ask about bad parses in spaCy v3.0.6. The first two parses I tried were both parsed disturbingly incorrectly:
"John eats cake": "cake" is parsed as a verb. It's not possible for any verb to appear there in this sentence, and cake
is rarely used as a verb at all (other than caking of sand, etc). This affects both en_core_web_sm
and en_core_web_md
, but not en_core_web_lg
.
"John eats salad": "salad" is parsed as a verb, and "eats" is parsed as an auxiliary (!!!). This affects only
en_core_web_sm
and not en_core_web_md
or en_core_web_lg
. Similar to the above comment https://github.com/explosion/spaCy/issues/3052#issuecomment-777275069, auxiliaries are a closed class in English and should really not ever apply to the verb eats
.
@hanryhu 😨
Thanks for the example, that definitely looks wrong. I wonder what's going on there, hm. I doubt salad
is even an unseen word!
Hi, I got another example of a word that is incorrectly labelled into a closed class: in particular, the word near
seems to always be parsed as SCONJ
. Why might misparses like this be introduced in spaCy 3?
There are some tokenization inconsistencies with French models with some common sentence structures, such as questions (inversion VERB then nsubj)
For instance, the following grammatical sentence in French "La librairie est-elle ouverte ?" (is the bookshop open ?) is tokenized as :
'La' | 'librairie' | 'est-elle' | 'ouverte' | '?' |
---|---|---|---|---|
'DET' | 'NOUN' | 'PRON' | 'ADJ' | 'PUNCT' |
when it really should be: | 'La' | 'librairie' | 'est | -elle' | 'ouverte' | '?' |
---|---|---|---|---|---|---|
'DET' | 'NOUN' | 'VERB' | 'PRON' | 'ADJ' | 'PUNCT' |
Sometimes, there is no problem, as in "La libraire veut-elle des bonbons ?" (Does the bookseller want candy?), which give, as expected :
'La' | 'libraire' | 'veut' | '-elle' | 'des' | 'bonbons' | '?' |
---|---|---|---|---|---|---|
'DET' | 'NOUN' | 'VERB' | 'PRON' | 'DET' | 'NOUN' | 'PUNCT' |
(Model used for generating those examples is fr_dep_news_trf )
For this sentence, spaCy tags "down" as ADV. But it seems that "down" should be tagged as ADP, since "down" is movable around the object.
(postag) lab-workstation:$ cat test_spaCy.py
import spacy
from spacy.tokens import Doc
class WhitespaceTokenizer:
def __init__(self, vocab):
self.vocab = vocab
def __call__(self, text):
words = text.split(" ")
spaces = [True] * len(words)
# Avoid zero-length tokens
for i, word in enumerate(words):
if word == "":
words[i] = " "
spaces[i] = False
# Remove the final trailing space
if words[-1] == " ":
words = words[0:-1]
spaces = spaces[0:-1]
else:
spaces[-1] = False
return Doc(self.vocab, words=words, spaces=spaces)
nlp = spacy.load('en_core_web_trf', exclude=['lemmatizer', 'ner'], )
sen = 'Hold the little chicken down on a flat surface .'
nlp.tokenizer = WhitespaceTokenizer(nlp.vocab)
doc = nlp(sen)
for token in doc:
print(token.i, token.text, token.pos_, token.dep_, token.head, token.head.i)
(postag) lab-workstation:$ python test_spaCy.py
0 Hold VERB ROOT Hold 0
1 the DET det chicken 3
2 little ADJ amod chicken 3
3 chicken NOUN dobj Hold 0
4 down ADV advmod Hold 0
5 on ADP prep Hold 0
6 a DET det surface 8
7 flat ADJ amod surface 8
8 surface NOUN pobj on 5
9 . PUNCT punct Hold 0
spaCy version: 3.1.0 Platform: Linux-4.15.0-48-generic-x86_64-with-debian-buster-sid Python version: 3.7.10 Pipelines: en_core_web_trf (3.1.0), en_core_web_sm (3.1.0)
I have a complaint about Portuguese model as well. The expression um irmão meu
is parsed wrong by pt_core_news_md
. pt_core_news_sm
functions correct. Here's output of medium model for dependency parse, there are 2 ROOTs:
>>> [token.dep_ for token in doc]
['det', 'ROOT', 'ROOT']
Second ROOT should be det.
Strange behavior with alignment and transformers. In the following text , and changing it a bit, the error disappears, the word "dog" in the second sentence, is aligned with two token-parts, but the second one is speed, the next word
spacyModel="en_core_web_trf"
nlp = spacy.load(spacyModel)
text= "Mariano jumps over a lazy dog. The dog speed is 20 km / h. "
doc =nlp(text)
align=doc._.trf_data.align
tokens=doc._.trf_data.tokens
for tok, parts in zip(doc,align):
list=[x for y in parts.data for x in y]
print (tok.text,parts.lengths,list ,'|'.join([tokens['input_texts'][0][part] for part in list]) )
produces this output Mariano [3] [1, 2, 3] M|arian|o jumps [1] [4] Ġjumps over [1] [5] Ġover a [1] [6] Ġa lazy [1] [7] Ġlazy dog [1] [8] Ġdog . [1] [9] . The [1] [10] ĠThe dog [2] [11, 12] Ġdog|Ġspeed speed [1] [12] Ġspeed is [1] [13] Ġis 20 [1] [14] Ġ20 km [1] [15] Ġkm / [1] [16] Ġ/ h. [2] [17, 18] Ġh|.
Maybe, should it be introduced as a bug?
Strange behavior with alignment and transformers. In the following text , and changing it a bit, the error disappears, the word "dog" in the second sentence, is aligned with two token-parts, but the second one is speed, the next word
spacyModel="en_core_web_trf" nlp = spacy.load(spacyModel) text= "Mariano jumps over a lazy dog. The dog speed is 20 km / h. " doc =nlp(text) align=doc._.trf_data.align tokens=doc._.trf_data.tokens for tok, parts in zip(doc,align): list=[x for y in parts.data for x in y] print (tok.text,parts.lengths,list ,'|'.join([tokens['input_texts'][0][part] for part in list]) )
produces this output Mariano [3] [1, 2, 3] M|arian|o jumps [1] [4] Ġjumps over [1] [5] Ġover a [1] [6] Ġa lazy [1] [7] Ġlazy dog [1] [8] Ġdog . [1] [9] . The [1] [10] ĠThe dog [2] [11, 12] Ġdog|Ġspeed speed [1] [12] Ġspeed is [1] [13] Ġis 20 [1] [14] Ġ20 km [1] [15] Ġkm / [1] [16] Ġ/ h. [2] [17, 18] Ġh|.
Maybe, should it be introduced as a bug?
FYI- the G unicode character is:
U+0120 : LATIN CAPITAL LETTER G WITH DOT ABOVE
My guess is the 0x20 part is for a space and (wilder guess) is the 0x01 might be the length of the space.
The Ġ I don't care. I think is part of the tokenizer, i did not check You should realize that dog has two parts [11, 12] while speed has also [12] meaning that the the word part 12 is duplicated
If you change the first sentence (just remove "lazy") then the result seems correct
text= "Mariano jumps over a dog. The dog speed is 20 km / h. "
Mariano [3] [1, 2, 3] M|arian|o
jumps [1] [4] Ġjumps
over [1] [5] Ġover
a [1] [6] Ġa
dog [1] [7] Ġdog
. [1] [8] .
The [1] [9] ĠThe
dog [1] [10] Ġdog
speed [1] [11] Ġspeed
is [1] [12] Ġis
20 [1] [13] Ġ20
km [1] [14] Ġkm
/ [1] [15] Ġ/
h. [2] [16, 17] Ġh|.
@joancf : I'm pretty sure this is an unfortunate side effect of not having a proper encoding for special tokens or special characters in the transformer tokenizer output. In the tokenizer output it looks identical when there was <s>
in the original input and when the tokenizer itself has inserted <s>
as a special token.
To align the two tokenizations, we're using an alignment algorithm that doesn't know anything about the special use of Ġ
, but it does know about unicode and that this character is related to a g
, so I think ends up aligned like this because of the final g
in dog
. If you replace g
with a different letter, it's not aligned like this.
We've also had issues with <s>
being aligned with the letter s
and since the tokenizer does have settings that show what its special tokens are, we try to ignore those while aligning when possible, but the various tokenizer algorithms vary a lot and in the general case we don't know which parts of the output are special characters.
If we want to drop support for slow tokenizers, I think we can potentially work with the alignment returned by the tokenizer, but we haven't gotten it to work consistently in the past and for now we're using this separate alignment method. I suspect this kind of alignment happens relatively rarely and doesn't affect the final annotation too much.
Hi @adrianeboyd, I think i can survive with this strange punctual issue, but when it comes to longer sentences the parts seem totally broken. It seems (please correct me if I'm wrong) that to process a long text internally, the nlp pipe splits it into blocks of ~145 original tokens (I don't know how to increase/modify that number) and the results in trf_data is also on these blocks (the size of all blocks is the same but it changes from document to document to the maximum number WordParts in any of them) But trf.align is a single ragged Array having the same size as tokens has the document.
In the next code... i tried with a long text:
text=""" This is a test sentence to check how spacy processes long texts and how to extract embbedings from them.
This is a test sentence to check how spacy processes long texts and how to extract embbedings from them.
This is a test sentence to check how spacy processes long texts and how to extract embbedings from them.
This is a test sentence to check how spacy processes long texts and how to extract embbedings from them.
And of course one would expect that the word sentence should have the same embedding in each sentence.
Well, maybe slightly different but not very different, so we can chen similarity.
This is a test sentence to check how spacy processes long texts and how to extract embbedings from them.
This is a test sentence to check how spacy processes long texts and how to extract embbedings from them.
This is a test sentence to check how spacy processes long texts and how to extract embbedings from them.
This is a test sentence to check how spacy processes long texts and how to extract embbedings from them.
This is a test sentence to check how spacy processes long texts and how to extract embbedings from them.
And of course one would expect that the word sentence should have the same embedding in each sentence.
Well, maybe slightly different but not very different, so we can chen similarity.
This is a test sentence to check how spacy processes long texts and how to extract embbedings from them.
This is a test sentence to check how spacy processes long texts and how to extract embbedings from them.
This is a test sentence to check how spacy processes long texts and how to extract embbedings from them.
This is a test sentence to check how spacy processes long texts and how to extract embbedings from them.
This is a test sentence to check how spacy processes long texts and how to extract embbedings from them.
"""
spacyModel="en_core_web_trf"
nlp = spacy.load(spacyModel)
doc =nlp(text)
toks=[tok for tok in doc]
print(f"tokens length {len(toks)}")
align=doc._.trf_data.align
tokens=doc._.trf_data.tokens
trf=doc._.trf_data.tensors
print (f" number of data wordParts {len(align)} distributed in {len(tokens['input_texts'])} chunks of size {len(tokens['input_texts'][0])} and transfomers shape {trf[0].shape} ")
# we can flatten the inputs and tensors(in ndarray not tensors), to apply alignment more easily
x,y,z=trf[0].shape
trf[0].shape=(1,-1,z)
print(trf[0].shape)
inputs = [x for ins in tokens['input_texts'] for x in ins]
print(f"size of flatten inputs and tensors: {len(inputs)} , {trf[0].shape}")
for tok, parts in zip(toks,align):
list=[x for y in parts.data for x in y]
print (tok.text, list ,'|'.join(inputs[part] for part in list) )
produces these outputs
number of data wordParts 369 distributed in 4 chunks of size 147 and transfomers shape (4, 147, 768)
(1, 588, 768)
size of flatten inputs and tensors: 588 , (1, 588, 768)
[]
This [2] ĠThis
is [3] Ġis
a [4] Ġa
test [5] Ġtest
sentence [6] Ġsentence
to [7] Ġto
check [8] Ġcheck
how [9] Ġhow
spacy [10, 11] Ġsp|acy
processes [12] Ġprocesses
long [13, 14] Ġlong|Ġtexts
texts [14] Ġtexts
.... more stuff here...
sentence [107] Ġsentence
should [108] Ġshould
have [109, 148] Ġhave|have
the [110, 149] Ġthe|Ġthe
same [111, 150] Ġsame|Ġsame
embedding [112, 113, 114, 151, 152] Ġembed|ding|Ġin|Ġembed|ding
in [114, 153] Ġin|Ġin
each [115, 154] Ġeach|Ġeach
sentence [116, 155] Ġsentence|Ġsentence
. [117, 156] .|.
......
. [135, 174] .|.
[]
This [137, 176] This|This
is [138, 177] Ġis|Ġis
a [139, 178] Ġa|Ġa
test [140, 179] Ġtest|Ġtest
sentence [141, 180] Ġsentence|Ġsentence
to [142, 181] Ġto|Ġto
check [182] Ġcheck
how [183] Ġhow
spacy [184, 185] Ġsp|acy
processes [186] Ġprocesses
long [187] Ġlong
texts [188] Ġtexts
So, the parts of "long" are wrong (the error mentioned before) But after the position 109 (have) the set of token parts present duplications, but also a distance of 40 tokens between parts, it seems is confusing one sentence and the next one, an error that happens up to position 142, where it jumps to 182 and seems to continue with correct results (single token part) , up to token 403 where the same behavior appears.
[Edited] It is not a bug, I answer myself. The reason this happens is that batches include the last part of the previous segment to give context. So Last words in a segment appear twice, and so, they also have two embeddings. E.g. the word in position 109 is dupicated in the next segment (starting at 148) and this is hapens for the next 40 word parts, So embedding of "have" should be the "average" of the embeddings [109, 148]
This means that nearly 1/4 (~40/160) of the tokens are processed twice
de_core_news_sm
does not correctly infer sentence boundaries for gendered sentences in German. In German, a current trend is to gender plurals of a word to make clear that women are included (since the generic plural is masculine). E.g. 'Kunde' (customer) becomes 'Kund:innen' (or 'Kund*innen' or 'Kund_innen' or 'KundInnen').
The sentence 'Selbstständige mit körperlichen Kund:innenkontakt sind ebenfalls dazu verpflichtet, sich mindestens zweimal pro Woche einem PoC - Test zu unterziehen.' gets split up in two sentences on the colon, even though it is one.
To reproduce the issue:
import spacy
nlp = spacy.load("de_core_news_sm")
sentence = 'Selbstständige mit körperlichen Kund:innenkontakt sind ebenfalls dazu verpflichtet, sich mindestens zweimal pro Woche einem PoC - Test zu unterziehen.'
print(list(nlp(sentence).sents))
yields
[Selbstständige mit körperlichen Kund:, innenkontakt sind ebenfalls dazu verpflichtet, sich mindestens zweimal pro Woche einem PoC - Test zu unterziehen.]
(So two instead of one sentence).
de_core_news_md
handles this correctly.
The core en_core_web_sm-3.0.0
seems to have trouble detecting organization entities if the brand contains some sort of pronoun \ ownership (my). Even when directly calling it a company to supply context. Sometimes it thinks it is a cardinal, other times money.
Or in the demo:
Passing it any sentence with brand names that contain this language appears to introduce a lot of consistency issues.
zh_core_web_trf
is not detecting sentence boundaries correctly in Chinese.
nlp = spacy.load("zh_core_web_trf")
doc = nlp("我是你的朋友。你是我的朋友吗?我不喜欢喝咖啡。")
This should be three separate sentence, but the sents
property only has one sentence.
Default stop words in Norwegian bokmål (nb) in Spacy contain important entities, e.g. France, Germany, Russia, Sweden and USA, police district, important units of time, e.g. months and days of the week, and organisations.
Nobody expects their presence among the default stop words. There is a danger of users complying with the general recommendation of filtering out stop words, while being unaware of filtering out important entities from their data.
The error candidates need to be removed from the default list of stop words, please see attached spreadsheet, which contains both Norwegian bokmål, English, if it is an error candidate, and a short comment about why.
While infecting the default list of stop words could be considered an attack vector, a way of "poisoning the well", this is probably due to a local stop word list having been committed to the central repository at some time by someone.
Below are the steps needed to reproduce the list of stop words in Norwegian bokmål.
# Import Spacy
import spacy
# Import Norwegian bokmål from Norwegian language
from spacy.lang.nb import Norwegian
# Importing stop words from Norwegian bokmål language
spacy_stopwords = spacy.lang.nb.stop_words.STOP_WORDS
# Printing the total number of stop words:
print('Default number of stop words in Norwegian bokmål in Spacy: %d' % len(spacy_stopwords))
# Printing stop words:
print('Default stop words in Norwegian bokmål in Spacy: %s' % list(spacy_stopwords)[:249])
Default stop words in Norwegian bokmål in Spacy: ['har', 'fjor', 'dem', 'får', 'oss', 'det', 'gikk', 'svært', 'tillegg', 'fem', 'fram', 'noe', 'ifølge', 'kontakt', 'og', 'få', 'ut', 'blant', 'fikk', 'være', 'mellom', 'videre', 'tyskland', 'der', 'tid', 'mot', 'bak', 'mål', 'ikke', 'laget', 'saken', 'landet', 'utenfor', 'bris', 'hennes', 'kom', 'seks', 'ha', 'hva', 'leder', 'å', 'denne', 'gjør', 'regjeringen', 'del', 'sted', 'man', 'funnet', 'prosent', 'bare', 'satt', 'gå', 'menn', 'tirsdag', 'nok', 'vært', 'her', 'en', 'ser', 'fredag', 'veldig', 'at', 'også', 'komme', 'først', 'kort', 'annen', 'gjennom', 'nye', 'når', 'kunne', 'annet', 'oslo', 'igjen', 'skulle', 'frankrike', 'i', 'et', 'klart', 'land', 'henne', 'meg', 'kveld', 'uten', 'president', 'drept', 'fire', 'kroner', 'under', 'fotball', 'fortsatt', 'ta', 'gjort', 'var', 'blir', 'politiet', 'av', 'fra', 'etter', 'sett', 'eller', 'bedre', 'inn', 'mens', 'andre', 'ny', 'på', 'til', 'ligger', 'helt', 'personer', 'ingen', 'ved', 'god', 'ville', 'and', 'vant', 'kvinner', 'som', 'politidistrikt', 'tror', 'slik', 'tre', 'tatt', 'løpet', 'store', 'viktig', 'kl', 'siste', 'måtte', 'like', 'for', 'flere', 'lørdag', 'millioner', 'allerede', 'usa', 'mars', 'seg', 'mannen', 'samme', 'sier', 'stor', 'mandag', 'jeg', 'noen', 'mange', 'mennesker', 'hvorfor', 'vi', 'ja', 'ntb', 'år', 'dette', 'beste', 'neste', 'står', 'litt', 'kampen', 'by', 'nå', 'sa', 'selv', 'vil', 'mye', 'gang', 'opp', 'bli', 'ble', 'er', 'godt', 'siden', 'russland', 'de', 'la', 'ett', 'stedet', 'før', 'norske', 'om', 'opplyser', 'ham', 'ned', 'kommer', 'rundt', 'tilbake', 'du', 'hans', 'kamp', 'minutter', 'gjøre', 'gjorde', 'september', 'den', 'sitt', 'sammen', 'hvor', 'to', 'så', 'han', 'sin', 'samtidig', 'viser', 'da', 'dag', 'grunn', 'alle', 'norge', 'msci', 'fått', 'hele', 'går', 'men', 'mener', 'norsk', 'se', 'ønsker', 'gi', 'hun', 'disse', 'hadde', 'plass', 'både', 'alt', 'torsdag', 'første', 'skal', 'må', 'søndag', 'kan', 'vår', 'senere', 'langt', 'tok', 'folk', 'dermed', 'med', 'mer', 'sverige', 'blitt', 'poeng', 'enn', 'over', 'runde', 'sine', 'tidligere', 'skriver', 'onsdag', 'hvordan'] ` 2021-12-06 NLP Spacy - stop words in Norwegian bokmål model - error candidates.xlsx
EDIT: Wow, these stop word errors have been in the Norwegian bokmål file since 2017! o_O See https://github.com/explosion/spaCy/blob/f46ffe3e893452bf0c171c6c7fcf3b0e458c8f9e/spacy/lang/nb/stop_words.py
Hi @HaakonME !
There is a danger of users complying with the general recommendation of filtering out stop words, while being unaware of filtering out important entities from their data.
I do want to point out that we don't typically recommend filtering out stop words, as with today's modern neural network approaches this is rarely needed or even useful. That said, some users do rely on them for various preprocessing needs, and I definitely agree with you that they should not contain meaningful words.
Default stop words in Norwegian bokmål (nb) in Spacy contain important entities, e.g. France, Germany, Russia, Sweden and USA, police district, important units of time, e.g. months and days of the week, and organisations.
If you would feel up to the challenge, we'd appreciate a PR to address some of the most obvious mistakes in the stop word lists. Ideally, that PR should be based off of our develop
branch, because we consider changing the current stop words as slightly breaking, and would keep the change for 3.3 (in contrast, the current master
branch will power the next 3.2.1 release).
If you need help for creating the PR, I could recommend reading the section over at https://github.com/explosion/spaCy/blob/master/CONTRIBUTING.md#getting-started and we can try to guide you as well :-)
Hi @svlandeg !
I have proposed a change to remove NER words from Norwegian stop words in the develop branch as suggested. :-)
Cheers, Haakon
@peterolson Sorry for the late reply, but thanks for reporting this. It does seem that the zh trf model really avoids recognizing short sentences.
We took a look at our training data (OntoNotes) and didn't find anything obviously wrong, but we'll keep looking at it.
Spanish tokenization is broken when there is no space between question sentences "?¿"
nlp = spacy.load("es_dep_news_trf")
doc = nlp("¿Qué quieres?¿Por qué estás aquí?")
quieres?¿Por
is treated as one token, but there should be a sentence boundary between "?" and "¿", and "quieres" and "Por" should be separate tokens.
NER recognizing 's
as an entity in the en_core_web_sm
and en_core_web_lg
models. Example below:
import spacy
content = """3 WallStreetBets Stocks to Trade | Markets Insider InvestorPlace - Stock Market News, Stock Advice & Trading Tips
What’s the next big thing on Wall Street? These days it might just be what’s trending or more specifically, receiving big-time mentions on WallStreetBets. Or not. The name in question might already be a titan of commerce.
Today let’s take a look at three such WallStreetBets stocks’ price charts and determine what’s technically hot and what’s not for your portfolio.
Reddit’s r/WallStreetBets. The seat-of-your-pants trading forum has made quite the name for itself in 2021. But you don’t need me to tell you that, right? Right."""
nlp_sm = spacy.load("en_core_web_sm")
nlp_md = spacy.load("en_core_web_md")
nlp_lg = spacy.load("en_core_web_lg")
nlp_sm(content).ents
Out[16]: (3, Stock Advice & Trading Tips, Today, ’s, three, 2021)
nlp_md(content).ents
Out[17]: (3, Stock Advice & Trading Tips, Today, three, 2021)
nlp_lg(content).ents
Out[18]: (3, These days, Today, ’s, three, Reddit, 2021)
Version Info:
pip list | grep spacy
spacy 3.0.6
spacy-alignments 0.8.3
spacy-legacy 3.0.8
spacy-stanza 1.0.0
spacy-transformers 1.0.2
@narayanacharya6 Cannot reproduce with 3.2. Can you upgrade and try again? Also include your model versions (spacy info
).
Note that ’s
and 's
are not the same, and the non-ASCII version is probably not in our training data. I suspect we fixed this with character augmentation at some point.
Outputs in previous comment were based on model version 3.0.0
. Tried version 3.2.0
- and ’s
is no longer identified as entity. Thanks!
For the German sentence "Die Ärmel der Strickjacke haben am Armabschluss ein Bündchen."
in v3.2.1 "Die Ärmel"
is parsed as Fem Singular instead of Masc Plural; in v3.1.4 the determiner "Die" was correctly parsed as Masc Plural ("Case=Nom|Definite=Def|Gender=Masc|Number=Plur|PronType=Art").
For the English sentence "Kennedy got killed."
, "got" is lemmatized to "got"
instead of "get"
.
Sorry for posting an unrelated point here, but I could not figure out a better place. Is there a reference to the model architecture / training code for the public models published by Spacy (e.g. 'en_core_web_md'). I looked at the spacy model repo, but that has models files and meta information, not the actual training code.
@saurav-chakravorty If you have a question it's better to open a new Discussion than to post in an unrelated thread.
The training code is not public, partly because the training data requires a license (like OntoNotes for English), partly because a lot of it is infra-related and not of public interest.
The list of stop words in Spanish contains many not very frequent verb forms and unusual words. Compared to the English list, there are many more words and many of them seem meaningful. It's a very strange selection.
Meaningful not very frequent words:
It contains even misspelled words (and the kind of misspell which are not frequent):
Update: I also noticed that there aren't any one-letter stop words, while in English, 'a' and 'i' are included in the list. In Spanish, these letters could be considered stop words:
https://github.com/explosion/spaCy/blob/master/spacy/lang/es/stop_words.py
@mgrojo Thanks for pointing that out! If you'd like to open a PR we'd be happy to review it.
@polm Thanks. I've already made that pull request.
Some weirdness in de_core_news_md-3.3.0
... I'm interested in lemmas, and I found Hässliche
varies depending on the context:
>>> nlp = spacy.load('de_core_news_md')
>>> [(x.lemma_, x.pos_) for x in nlp('Die neuste philosofische Prägung wird Hässliche genannt.')]
[('der', 'DET'), ('neuste', 'ADJ'), ('philosofisch', 'ADJ'), ('Prägung', 'NOUN'), ('werden', 'AUX'), ('hässliche', 'NOUN'), ('nennen', 'VERB'), ('--', 'PUNCT')]
>>> [(x.lemma_, x.pos_) for x in nlp('die Hässliche')]
[('der', 'DET'), ('Hässliche', 'NOUN')]
@dblandan The v3.3 German models switched from a lookup lemmatizer that only used the word form (no context) to a statistical lemmatizer where the output does depend on the context.
@dblandan The v3.3 German models switched from a lookup lemmatizer that only used the word form (no context) to a statistical lemmatizer where the output does depend on the context.
So there are different lexical entries for hässliche (NOUN)
and Hässliche (NOUN)
, and one of them is capitalized while the other isn't. I'm ok with there being different entries, but I don't understand why one isn't capitalized given that it's still a noun. :thinking:
The adjectival form lemmatizes correctly to hässlich
.
For reference:
hässlich ADJ 17149702774860831989
hässliche NOUN 5552098829343672028
Hässliche NOUN 17159517463969337747
The difference is that it's not looking up word forms in a table anymore, so it's not just based on an entry related to the POS or the word form. The lemmatizer is a statistical model like the tagger
that uses the context to predict the lemmas based on the training data. For more details about how it works: https://explosion.ai/blog/edit-tree-lemmatizer
I see. I knew that the edit-tree lemmatizer was coming; I'm still surprised about this particular output. I'll just handle it in post-processing. Thanks for the reply! :smile:
👋🏻 🤗
Let me know if there's a better place for this. I came across odd behavior from the English lemmatizer that seemed worth reporting.
Here's minimal reproduction steps to see that in certain circumstances the lemmatizer predicts/maps "guys" -> "you"
>>> import spacy
>>> spacy.__version__
'3.2.4'
>>> nlp = spacy.load("en_web_core_md")
>>> nlp("The guys all")[1].lemma_
'you'
This thread is a master thread for collecting problems and reports related to incorrect and/or problematic predictions of the pre-trained models.
Why a master thread instead of separate issues?
GitHub now supports pinned issues, which lets us create master threads more easily without them getting buried.
Users often report issues that come down to incorrect predictions made by the pre-trained statistical models. Those are all good and valid, and can include very useful test cases. However, having a lot of open issues around minor incorrect predictions across various languages also makes it more difficult to keep track of the reports. Unlike bug reports, they're much more difficult to action on. Sometimes, mistakes a model makes can indicate deeper problems that occurred during training or when preprocessing the data. Sometimes they can give us ideas for how to use data augmentation to make the models less sensitive to very small variations like punctuation or capitalisation.
Other times, it's just something we have to accept. A model that's 90% accurate will make a mistake on every 10th prediction. A model that's 99% accurate will be wrong once every 100 predictions.
The main reason we distribute pre-trained models is that it makes it easier for users to build their own systems by fine-tuning pre-trained models on their data. Of course, we want them to be as good as possible, and we're always optimising for the best compromise of speed, size and accuracy. But we won't be able to ship pre-trained models that are always correct on all data ever.
For many languages, we're also limited by the resources available, especially when it comes to data for named entity recognition. We've already made substantial investments into licensing training corpora, and we'll continue doing so (including running our own annotation projects with Prodigy ✨) – but this will take some time.
Reporting incorrect predictions in this thread
If you've come across suspicious predictions in the pre-trained models (tagger, parser, entity recognizer) or you want to contribute test cases for a given language, feel free to submit them here. (Test cases should be "fair" and useful for measuring the model's general accuracy, so single words, significant typos and very ambiguous parses aren't usually that helpful.)
You can check out our new models test suite for spaCy
v2.1.0
to see the tests we're currently running.