explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.25k stars 4.32k forks source link

📚 Inaccurate pre-trained model predictions master thread #3052

Open ines opened 5 years ago

ines commented 5 years ago

This thread is a master thread for collecting problems and reports related to incorrect and/or problematic predictions of the pre-trained models.

Why a master thread instead of separate issues?

GitHub now supports pinned issues, which lets us create master threads more easily without them getting buried.

Users often report issues that come down to incorrect predictions made by the pre-trained statistical models. Those are all good and valid, and can include very useful test cases. However, having a lot of open issues around minor incorrect predictions across various languages also makes it more difficult to keep track of the reports. Unlike bug reports, they're much more difficult to action on. Sometimes, mistakes a model makes can indicate deeper problems that occurred during training or when preprocessing the data. Sometimes they can give us ideas for how to use data augmentation to make the models less sensitive to very small variations like punctuation or capitalisation.

Other times, it's just something we have to accept. A model that's 90% accurate will make a mistake on every 10th prediction. A model that's 99% accurate will be wrong once every 100 predictions.

The main reason we distribute pre-trained models is that it makes it easier for users to build their own systems by fine-tuning pre-trained models on their data. Of course, we want them to be as good as possible, and we're always optimising for the best compromise of speed, size and accuracy. But we won't be able to ship pre-trained models that are always correct on all data ever.

For many languages, we're also limited by the resources available, especially when it comes to data for named entity recognition. We've already made substantial investments into licensing training corpora, and we'll continue doing so (including running our own annotation projects with Prodigy ✨) – but this will take some time.

Reporting incorrect predictions in this thread

If you've come across suspicious predictions in the pre-trained models (tagger, parser, entity recognizer) or you want to contribute test cases for a given language, feel free to submit them here. (Test cases should be "fair" and useful for measuring the model's general accuracy, so single words, significant typos and very ambiguous parses aren't usually that helpful.)

You can check out our new models test suite for spaCy v2.1.0 to see the tests we're currently running.

fersarr commented 4 years ago

just FYI, another regression below. However, this one seems to only have happened after 2.2.0 because the spacy visualizer demo (2.2.0) shows it correctly

v.2.0.18 Screen Shot 2020-04-24 at 5 52 08 PM

v.2.2.4 Screen Shot 2020-04-24 at 5 52 28 PM

beckernick commented 4 years ago

Noticed an example in which the small model fails but the medium model succeeds. murmured is incorrectly tagged as a proper noun starting the sentence in the example below (perhaps explaining the lack of lemmatization). When a noun phrase precedes it, it's correctly parsed.

Small Model:

import spacy

​nlp_sm = spacy.load("en_core_web_sm")
nlp_md = spacy.load("en_core_web_md")

sent = "murmured Nick in the library"
for token in nlp_sm(sent):
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_, "\n")
murmured murmured PROPN NNP ROOT 

Nick Nick PROPN NNP dobj 

in in ADP IN prep 

the the DET DT det 

library library NOUN NN pobj 
import spacy
nlp_sm = spacy.load("en_core_web_sm")
​
sent = "I murmured in the library"
for token in nlp_sm(sent):
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_, "\n")
I -PRON- PRON PRP nsubj 

murmured murmur VERB VBD ROOT 

in in ADP IN prep 

the the DET DT det 

library library NOUN NN pobj 

Medium Model

import spacy
​
sent = "murmured Nick in the library"
for token in nlp_md(sent):
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_, "\n")
murmured murmur VERB VBD ROOT 

Nick Nick PROPN NNP npadvmod 

in in ADP IN prep 

the the DET DT det 

library library NOUN NN pobj 
spacy                     2.2.3            py37ha1b3eb9_0    conda-forge
stelmath commented 4 years ago

Hello there,

I am wondering if there is a way to force the POS tagger to treat tokens as non-verbs in order to not mess up the dependency parser. In my case, I have as input a long list of noun chunks, hence no verbs are expected to occur in my input. I noticed that for some cases the POS tagger gets confused:

import spacy

nlp = spacy.load('en_core_web_lg')
chunks = ['reading light', 'flashing light']

for chunk in chunks:
    doc = nlp(chunk)
    for token in doc:
        print(token.text, token.dep_, token.tag_)
    print('-'*10) 

yields:

reading ROOT VBG
light dobj NN
----------
flashing ROOT VBG
light dobj NN

while the expected output would be that in both chunks the ROOT is "light". So, can I hint the tagger that I am giving it something that can't be verb-ish? That way the parser would not fail, I presume.

Thanks!

sam-writer commented 3 years ago

I'm not sure if this is the place to ask, but I'm wondering, given the state of the art for POS tagging as reported by: https://aclweb.org/aclwiki/POS_Tagging_(State_of_the_art) - is there a particular reason that spaCy uses its own trained models rather than wrappers for existing models that report better accuracy numbers? I understand their reported accuracy might be on a completely different set of benchmarks, but have they been evaluated on spaCy's benchmarks? Are there licensing restrictions that keep them from being integrated as POS taggers? A parallel might be the vast number of pretrained language/embedding models implemented by HuggingFace's or TensorFlow Hub's repos, many of which are developed by people not directly associated with the repos themselves.

My understanding is that spaCy defaults to trading some accuracy for speed. Using the defaults means you get that compromise too. But spaCy is very hackable, so you can BYOM. They also link to (and I think, made) spaCy Stanza which is one way to use fancier models that are slower but more accurate.

andreas-wolf commented 3 years ago

All three German language models do not recognize Karlsruhe (a city) as LOC, but smaller cities.

import spacy
nlp_de = spacy.load('de_core_news_lg')
doc_de = nlp_de('Ettlingen liegt bei Karlsruhe.')
for entity in doc_de.ents:
    print(entity.text, entity.label_)

Result:

Ettlingen LOC
Karlsruhe MISC
barataplastica commented 3 years ago

I noticed that the NER makes some mistakes when tagging text with money depending on the numerical value and I wonder if you could do something about it when training the next version of the models.

For instance, if I run the following notebook cell

import spacy
nlp = spacy.load('en_core_web_lg')

symbols = ["$", "£", "€", "¥"]
for symbol in symbols:
    print("-----------------------")
    print("Symbol: {}".format(symbol))
    print("-----------------------")
    for j in range (0, 2):
        for i in range (1, 10):
            text = nlp(symbol + str(j) + '.' + str(i) + 'mn favourable variance in employee expense over the forecast period')
            print (str(j) + '.' + str(i))
            for ent in text.ents:
                print(ent.text, ent.start_char, ent.end_char, ent.label_)

I get the following results, where

-----------------------
Symbol: $
-----------------------
0.1
0.1mn 1 6 MONEY
0.2
0.2mn 1 6 MONEY
0.3
0.3mn 1 6 MONEY
0.4
0.4mn 1 6 MONEY
0.5
0.5mn 1 6 MONEY
0.6
0.6mn 1 6 MONEY
0.7
0.7mn 1 6 MONEY
0.8
0.8mn 1 6 MONEY
0.9
0.9mn 1 6 MONEY
1.1
1.2
1.2mn 1 6 MONEY
1.3
1.3mn 1 6 MONEY
1.4
1.4mn 1 6 MONEY
1.5
1.6
1.6mn 1 6 MONEY
1.7
1.7mn 1 6 MONEY
1.8
1.8mn 1 6 MONEY
1.9
1.9mn 1 6 MONEY
-----------------------
Symbol: £
-----------------------
0.1
0.1mn 1 6 MONEY
0.2
0.2mn 1 6 MONEY
0.3
0.3mn 1 6 MONEY
0.4
0.4mn 1 6 ORG
0.5
0.6
0.6mn 1 6 MONEY
0.7
0.7mn 1 6 MONEY
0.8
0.9
0.9mn 1 6 MONEY
1.1
1.2
1.2mn 1 6 MONEY
1.3
1.3mn 1 6 MONEY
1.4
1.4mn 1 6 MONEY
1.5
1.6
1.6mn 1 6 MONEY
1.7
1.7mn 1 6 MONEY
1.8
1.8mn 1 6 MONEY
1.9
1.9mn 1 6 MONEY
-----------------------
Symbol: €
-----------------------
0.1
0.1mn 1 6 MONEY
0.2
0.3
0.3mn 1 6 MONEY
0.4
0.5
0.6
0.6mn 1 6 MONEY
0.7
0.7mn 1 6 MONEY
0.8
0.9
1.1
1.2
1.2mn 1 6 MONEY
1.3
1.4
1.5
1.6
1.6mn 1 6 MONEY
1.7
1.7mn 1 6 MONEY
1.8
1.9
1.9mn 1 6 MONEY
-----------------------
Symbol: ¥
-----------------------
0.1
0.2
0.3
0.4
¥0.4mn 0 6 ORG
0.5
¥0.5mn 0 6 ORG
0.6
¥0.6mn 0 6 GPE
0.7
0.8
¥0.8mn 0 6 ORG
0.9
¥0.9mn 0 6 ORG
1.1
¥1.1mn 0 6 ORG
1.2
1.3
1.4
1.5
1.6
¥1.6mn 0 6 ORG
1.7
1.8
¥1.8mn 0 6 NORP
1.9
¥1.9mn 0 6 CARDINAL

Info about model 'en_core_web_lg'

(...) version 2.3.1
spacy_version >=2.3.0,<2.4.0

abh1nay commented 3 years ago

Facing an interesting issue with the large (and small) pre trained models

import spacy nlp = spacy.load('en_core_web_lg')

text = "POL /hi there abcde ffff" doc = nlp(text) for ent in doc.ents: ... print(ent.text, ent.start_char, ent.endchar, ent.label) ... POL 0 3 PERSON

if I remove the "/" it does not detect any entities (the expected behavior) any idea why the leading slash throws NER off?

This is observed using spacy 2.2.3

Btw I also tried with the spacy nightly and TRF model, the issue does not exist with the transformer model

adam-ra commented 3 years ago

Spacy 3.0 tags “tummy” as a determiner in “tummy ache”. I think this is serious, since determiners are closed class words and arguably there should be literally half a dozen of them in English language and never more. Tagging any content word as a determiner is likely to cause many apps to ignore it.

Model: en_core_web_lg-3.0.0

In [1]: import spacy; nlp = spacy.load('en_core_web_lg')

In [2]: [tok.tag_ for tok in nlp('tummy ache')]
Out[2]: ['DT', 'NN']

gitclem commented 3 years ago

Here's a perverse case. Is Will a first name or an AUX MD? spaCy 3.0.1 with en_core_web_sm gets confused.

nlp = spacy.load('en_core_web_sm')
doc = nlp("Will Will Shakespeare write his will?")
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop)

Will will AUX MD aux Xxxx True True
Will will AUX MD aux Xxxx True True
Shakespeare shakespeare VERB VB nsubj Xxxxx True False
write write VERB VB ROOT xxxx True False
his his PRON PRP$ poss xxx True True
will will NOUN NN dobj xxxx True True
? ? PUNCT . punct ? False False
gitclem commented 3 years ago

Here's another perverse case. Is May a first name or an AUX MD? spaCy 3.0.1 with en_core_web_sm gets confused.

nlp = spacy.load('en_core_web_sm')
doc = nlp("May May celebrate May Day with us?")
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop)

May May PROPN NNP ROOT Xxx True True
May may AUX MD aux Xxx True True
celebrate celebrate VERB VB ROOT xxxx True False
May May PROPN NNP compound Xxx True True
Day Day PROPN NNP npadvmod Xxx True False
with with ADP IN prep xxxx True True
us we PRON PRP pobj xx True True
? ? PUNCT . punct ? False False

However, this similar sentence treats May differently:

doc = nlp("May May come over for dinner?")

for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop)

May may AUX NNP aux Xxx True True
May may AUX MD aux Xxx True True
come come VERB VB ROOT xxxx True False
over over ADP RP prt xxxx True True
for for ADP IN prep xxx True True
dinner dinner NOUN NN pobj xxxx True False
? ? PUNCT . punct ? False False

and of course, there are a number of women's names that could cause a parser fits (Tom Switfy):

and some men's names:

paulbriton commented 3 years ago

While upgrading from spaCy 2.3.1 to 3.0.0, I've noticed that some person entities are no longer detected:

import en_core_web_lg
nlp = en_core_web_lg.load()
doc = nlp('I am not Mr Foo.')
print(doc.ents) 
# (Mr Foo,) with 2.3.1
# () with 3.0.0
Riccorl commented 3 years ago

The Multilingual model xx_sent_ud_sm does not tokenize correctly Chinese sentences, while the Chinese model zh_core_web_sm does. For example:

import spacy

nlp_ml = spacy.load("xx_sent_ud_sm")
nlp_ml.tokenizer("包括联合国机构和机制提出的有关建议以及现有的外部资料对有关国家进行筹备性研究。")
# ['包括联合国机构和机制提出的有关建议以及现有的外部资料对有关国家进行筹备性研究', '。']

nlp_zh= spacy.load("zh_core_web_sm")
nlp_zh.tokenizer("包括联合国机构和机制提出的有关建议以及现有的外部资料对有关国家进行筹备性研究。")
# ['包括', '联合国', '机构', '和', '机制', '提出', '的', '有关', '建议', '以及', '现有', '的', '外部', '资料', '对', '有关', '国家', '进行', '筹备性', '研究', '。']

SpaCy version is 3.0.0

adrianeboyd commented 3 years ago

@Riccorl : This is the expected behavior for the base xx tokenizer used in that model, which just doesn't work for languages without whitespace between tokens. It was a mistake to include the Chinese or Japanese training corpora in the xx_sent_ud_sm 3.0.0 model. They'll be omitted in the next release.

The zh_core_web_sm model uses a completely separate tokenizer based on pkuseg to do word segmentation.

Riccorl commented 3 years ago

@Riccorl : This is the expected behavior for the base xx tokenizer used in that model, which just doesn't work for languages without whitespace between tokens. It was a mistake to include the Chinese or Japanese training corpora in the xx_sent_ud_sm 3.0.0 model. They'll be omitted in the next release.

The zh_core_web_sm model uses a completely separate tokenizer based on pkuseg to do word segmentation.

Clear. Thank you for the explanation.

ezorita commented 3 years ago

There is a consistent issue with two-word adjectives in English language. Hyphenated two-word adjectives with prepositions are tokenized apart and the POS tagging model is unable to recognize them as an adjective. This causes the model to fail when extracting noun chunks: image To my understanding on-board should be identified as ADJ and its dependency to charger should be amod, which would not break the noun chunk dependency tree.

Another example: image

Two-word adjectives not starting with prepositions are properly detected: image image

I tried the following pipelines and all have the same issue:

polm commented 3 years ago

@ezorita Had a quick look at the training data (OntoNotes) and it looks like "for-profit" is consistently annotated as a prepositional phrase with three tokens, while "non-profit" is a single token adjective. So it looks like this may just be a quirk of our training data.

hanryhu commented 3 years ago

Hi, I hope this is the right place to ask about bad parses in spaCy v3.0.6. The first two parses I tried were both parsed disturbingly incorrectly: "John eats cake": "cake" is parsed as a verb. It's not possible for any verb to appear there in this sentence, and cake is rarely used as a verb at all (other than caking of sand, etc). This affects both en_core_web_sm and en_core_web_md, but not en_core_web_lg. image "John eats salad": "salad" is parsed as a verb, and "eats" is parsed as an auxiliary (!!!). This affects only en_core_web_sm and not en_core_web_md or en_core_web_lg. Similar to the above comment https://github.com/explosion/spaCy/issues/3052#issuecomment-777275069, auxiliaries are a closed class in English and should really not ever apply to the verb eats. image

honnibal commented 3 years ago

@hanryhu 😨

Thanks for the example, that definitely looks wrong. I wonder what's going on there, hm. I doubt salad is even an unseen word!

hanryhu commented 3 years ago

Hi, I got another example of a word that is incorrectly labelled into a closed class: in particular, the word near seems to always be parsed as SCONJ. Why might misparses like this be introduced in spaCy 3?

image
dorianve commented 2 years ago

There are some tokenization inconsistencies with French models with some common sentence structures, such as questions (inversion VERB then nsubj)

For instance, the following grammatical sentence in French "La librairie est-elle ouverte ?" (is the bookshop open ?) is tokenized as :

'La' 'librairie' 'est-elle' 'ouverte' '?'
'DET' 'NOUN' 'PRON' 'ADJ' 'PUNCT'
when it really should be: 'La' 'librairie' 'est -elle' 'ouverte' '?'
'DET' 'NOUN' 'VERB' 'PRON' 'ADJ' 'PUNCT'

Sometimes, there is no problem, as in "La libraire veut-elle des bonbons ?" (Does the bookseller want candy?), which give, as expected :

'La' 'libraire' 'veut' '-elle' 'des' 'bonbons' '?'
'DET' 'NOUN' 'VERB' 'PRON' 'DET' 'NOUN' 'PUNCT'

(Model used for generating those examples is fr_dep_news_trf )

muchang commented 2 years ago

For this sentence, spaCy tags "down" as ADV. But it seems that "down" should be tagged as ADP, since "down" is movable around the object.

(postag) lab-workstation:$ cat test_spaCy.py 
import spacy
from spacy.tokens import Doc

class WhitespaceTokenizer:
    def __init__(self, vocab):
        self.vocab = vocab

  def __call__(self, text):
      words = text.split(" ")
      spaces = [True] * len(words)
      # Avoid zero-length tokens
      for i, word in enumerate(words):
          if word == "":
              words[i] = " "
              spaces[i] = False
      # Remove the final trailing space
      if words[-1] == " ":
          words = words[0:-1]
          spaces = spaces[0:-1]
      else:
          spaces[-1] = False
      return Doc(self.vocab, words=words, spaces=spaces)

nlp = spacy.load('en_core_web_trf', exclude=['lemmatizer', 'ner'], )

sen = 'Hold the little chicken down on a flat surface .'
nlp.tokenizer = WhitespaceTokenizer(nlp.vocab)
doc = nlp(sen)

for token in doc:
    print(token.i, token.text, token.pos_, token.dep_, token.head, token.head.i)

(postag) lab-workstation:$  python test_spaCy.py 
0 Hold VERB ROOT Hold 0
1 the DET det chicken 3
2 little ADJ amod chicken 3
3 chicken NOUN dobj Hold 0
4 down ADV advmod Hold 0
5 on ADP prep Hold 0
6 a DET det surface 8
7 flat ADJ amod surface 8
8 surface NOUN pobj on 5
9 . PUNCT punct Hold 0

spaCy version: 3.1.0 Platform: Linux-4.15.0-48-generic-x86_64-with-debian-buster-sid Python version: 3.7.10 Pipelines: en_core_web_trf (3.1.0), en_core_web_sm (3.1.0)

DuyguA commented 2 years ago

I have a complaint about Portuguese model as well. The expression um irmão meu is parsed wrong by pt_core_news_md. pt_core_news_sm functions correct. Here's output of medium model for dependency parse, there are 2 ROOTs:

>>> [token.dep_ for token in doc]
['det', 'ROOT', 'ROOT']

Second ROOT should be det.

joancf commented 2 years ago

Strange behavior with alignment and transformers. In the following text , and changing it a bit, the error disappears, the word "dog" in the second sentence, is aligned with two token-parts, but the second one is speed, the next word

spacyModel="en_core_web_trf"
nlp = spacy.load(spacyModel)
text= "Mariano jumps over a lazy dog. The dog speed is 20 km / h. "
doc =nlp(text) 
align=doc._.trf_data.align
tokens=doc._.trf_data.tokens
for tok, parts in zip(doc,align):
  list=[x for  y in parts.data for x in y]
  print (tok.text,parts.lengths,list  ,'|'.join([tokens['input_texts'][0][part] for part in list]) )

produces this output Mariano [3] [1, 2, 3] M|arian|o jumps [1] [4] Ġjumps over [1] [5] Ġover a [1] [6] Ġa lazy [1] [7] Ġlazy dog [1] [8] Ġdog . [1] [9] . The [1] [10] ĠThe dog [2] [11, 12] Ġdog|Ġspeed speed [1] [12] Ġspeed is [1] [13] Ġis 20 [1] [14] Ġ20 km [1] [15] Ġkm / [1] [16] Ġ/ h. [2] [17, 18] Ġh|.

Maybe, should it be introduced as a bug?

gitclem commented 2 years ago

Strange behavior with alignment and transformers. In the following text , and changing it a bit, the error disappears, the word "dog" in the second sentence, is aligned with two token-parts, but the second one is speed, the next word

spacyModel="en_core_web_trf"
nlp = spacy.load(spacyModel)
text= "Mariano jumps over a lazy dog. The dog speed is 20 km / h. "
doc =nlp(text) 
align=doc._.trf_data.align
tokens=doc._.trf_data.tokens
for tok, parts in zip(doc,align):
  list=[x for  y in parts.data for x in y]
  print (tok.text,parts.lengths,list  ,'|'.join([tokens['input_texts'][0][part] for part in list]) )

produces this output Mariano [3] [1, 2, 3] M|arian|o jumps [1] [4] Ġjumps over [1] [5] Ġover a [1] [6] Ġa lazy [1] [7] Ġlazy dog [1] [8] Ġdog . [1] [9] . The [1] [10] ĠThe dog [2] [11, 12] Ġdog|Ġspeed speed [1] [12] Ġspeed is [1] [13] Ġis 20 [1] [14] Ġ20 km [1] [15] Ġkm / [1] [16] Ġ/ h. [2] [17, 18] Ġh|.

Maybe, should it be introduced as a bug?

FYI- the G unicode character is:

U+0120 : LATIN CAPITAL LETTER G WITH DOT ABOVE

My guess is the 0x20 part is for a space and (wilder guess) is the 0x01 might be the length of the space.

joancf commented 2 years ago

The Ġ I don't care. I think is part of the tokenizer, i did not check You should realize that dog has two parts [11, 12] while speed has also [12] meaning that the the word part 12 is duplicated

If you change the first sentence (just remove "lazy") then the result seems correct

text= "Mariano jumps over a dog. The dog speed is 20 km / h. "

Mariano [3] [1, 2, 3] M|arian|o
jumps [1] [4] Ġjumps
over [1] [5] Ġover
a [1] [6] Ġa
dog [1] [7] Ġdog
. [1] [8] .
The [1] [9] ĠThe
dog [1] [10] Ġdog
speed [1] [11] Ġspeed
is [1] [12] Ġis
20 [1] [13] Ġ20
km [1] [14] Ġkm
/ [1] [15] Ġ/
h. [2] [16, 17] Ġh|.
adrianeboyd commented 2 years ago

@joancf : I'm pretty sure this is an unfortunate side effect of not having a proper encoding for special tokens or special characters in the transformer tokenizer output. In the tokenizer output it looks identical when there was <s> in the original input and when the tokenizer itself has inserted <s> as a special token.

To align the two tokenizations, we're using an alignment algorithm that doesn't know anything about the special use of Ġ, but it does know about unicode and that this character is related to a g, so I think ends up aligned like this because of the final g in dog. If you replace g with a different letter, it's not aligned like this.

We've also had issues with <s> being aligned with the letter s and since the tokenizer does have settings that show what its special tokens are, we try to ignore those while aligning when possible, but the various tokenizer algorithms vary a lot and in the general case we don't know which parts of the output are special characters.

If we want to drop support for slow tokenizers, I think we can potentially work with the alignment returned by the tokenizer, but we haven't gotten it to work consistently in the past and for now we're using this separate alignment method. I suspect this kind of alignment happens relatively rarely and doesn't affect the final annotation too much.

joancf commented 2 years ago

Hi @adrianeboyd, I think i can survive with this strange punctual issue, but when it comes to longer sentences the parts seem totally broken. It seems (please correct me if I'm wrong) that to process a long text internally, the nlp pipe splits it into blocks of ~145 original tokens (I don't know how to increase/modify that number) and the results in trf_data is also on these blocks (the size of all blocks is the same but it changes from document to document to the maximum number WordParts in any of them) But trf.align is a single ragged Array having the same size as tokens has the document.

In the next code... i tried with a long text:

text="""  This is a test sentence to check how spacy processes long texts and how to extract embbedings from them.
This is a test sentence to check how spacy processes long texts and how to extract embbedings from them.
This is a test sentence to check how spacy processes long texts and how to extract embbedings from them.
This is a test sentence to check how spacy processes long texts and how to extract embbedings from them.
And of course one would expect that the word sentence should have the same embedding in each sentence.
Well, maybe slightly different but not very different, so we can chen similarity.
This is a test sentence to check how spacy processes long texts and how to extract embbedings from them.
This is a test sentence to check how spacy processes long texts and how to extract embbedings from them.
This is a test sentence to check how spacy processes long texts and how to extract embbedings from them.
This is a test sentence to check how spacy processes long texts and how to extract embbedings from them.
This is a test sentence to check how spacy processes long texts and how to extract embbedings from them.
And of course one would expect that the word sentence should have the same embedding in each sentence.
Well, maybe slightly different but not very different, so we can chen similarity.
This is a test sentence to check how spacy processes long texts and how to extract embbedings from them.
This is a test sentence to check how spacy processes long texts and how to extract embbedings from them.
This is a test sentence to check how spacy processes long texts and how to extract embbedings from them.
This is a test sentence to check how spacy processes long texts and how to extract embbedings from them.
This is a test sentence to check how spacy processes long texts and how to extract embbedings from them.
"""
spacyModel="en_core_web_trf"
nlp = spacy.load(spacyModel)
doc =nlp(text)
toks=[tok for tok in doc]
print(f"tokens length {len(toks)}")
align=doc._.trf_data.align
tokens=doc._.trf_data.tokens
trf=doc._.trf_data.tensors
print (f" number of data wordParts  {len(align)} distributed in  {len(tokens['input_texts'])} chunks of size  {len(tokens['input_texts'][0])} and transfomers shape {trf[0].shape}  ")
# we can flatten the inputs and tensors(in ndarray not tensors), to apply alignment more easily
x,y,z=trf[0].shape
trf[0].shape=(1,-1,z)
print(trf[0].shape)
inputs = [x for ins in tokens['input_texts'] for x in ins]
print(f"size of flatten inputs and  tensors: {len(inputs)} , {trf[0].shape}")
for tok, parts in zip(toks,align):
  list=[x for  y in parts.data for x in y]
  print (tok.text, list  ,'|'.join(inputs[part] for part in list) )

produces these outputs

 number of data wordParts  369 distributed in  4 chunks of size  147 and transfomers shape (4, 147, 768)  
(1, 588, 768)
size of flatten inputs and  tensors: 588 , (1, 588, 768)
   [] 
This [2] ĠThis
is [3] Ġis
a [4] Ġa
test [5] Ġtest
sentence [6] Ġsentence
to [7] Ġto
check [8] Ġcheck
how [9] Ġhow
spacy [10, 11] Ġsp|acy
processes [12] Ġprocesses
long [13, 14] Ġlong|Ġtexts
texts [14] Ġtexts
.... more stuff here...
sentence [107] Ġsentence
should [108] Ġshould
have [109, 148] Ġhave|have
the [110, 149] Ġthe|Ġthe
same [111, 150] Ġsame|Ġsame
embedding [112, 113, 114, 151, 152] Ġembed|ding|Ġin|Ġembed|ding
in [114, 153] Ġin|Ġin
each [115, 154] Ġeach|Ġeach
sentence [116, 155] Ġsentence|Ġsentence
. [117, 156] .|.
......
. [135, 174] .|.

 [] 
This [137, 176] This|This
is [138, 177] Ġis|Ġis
a [139, 178] Ġa|Ġa
test [140, 179] Ġtest|Ġtest
sentence [141, 180] Ġsentence|Ġsentence
to [142, 181] Ġto|Ġto
check [182] Ġcheck
how [183] Ġhow
spacy [184, 185] Ġsp|acy
processes [186] Ġprocesses
long [187] Ġlong
texts [188] Ġtexts

So, the parts of "long" are wrong (the error mentioned before) But after the position 109 (have) the set of token parts present duplications, but also a distance of 40 tokens between parts, it seems is confusing one sentence and the next one, an error that happens up to position 142, where it jumps to 182 and seems to continue with correct results (single token part) , up to token 403 where the same behavior appears.

[Edited] It is not a bug, I answer myself. The reason this happens is that batches include the last part of the previous segment to give context. So Last words in a segment appear twice, and so, they also have two embeddings. E.g. the word in position 109 is dupicated in the next segment (starting at 148) and this is hapens for the next 40 word parts, So embedding of "have" should be the "average" of the embeddings [109, 148]

This means that nearly 1/4 (~40/160) of the tokens are processed twice

d-e-h-i-o commented 2 years ago

de_core_news_sm does not correctly infer sentence boundaries for gendered sentences in German. In German, a current trend is to gender plurals of a word to make clear that women are included (since the generic plural is masculine). E.g. 'Kunde' (customer) becomes 'Kund:innen' (or 'Kund*innen' or 'Kund_innen' or 'KundInnen').

The sentence 'Selbstständige mit körperlichen Kund:innenkontakt sind ebenfalls dazu verpflichtet, sich mindestens zweimal pro Woche einem PoC - Test zu unterziehen.' gets split up in two sentences on the colon, even though it is one.

To reproduce the issue:

import spacy
nlp = spacy.load("de_core_news_sm")
sentence = 'Selbstständige mit körperlichen Kund:innenkontakt sind ebenfalls dazu verpflichtet, sich mindestens zweimal pro Woche einem PoC - Test zu unterziehen.'
print(list(nlp(sentence).sents))

yields [Selbstständige mit körperlichen Kund:, innenkontakt sind ebenfalls dazu verpflichtet, sich mindestens zweimal pro Woche einem PoC - Test zu unterziehen.] (So two instead of one sentence).

de_core_news_md handles this correctly.

WalrusSoup commented 2 years ago

The core en_core_web_sm-3.0.0 seems to have trouble detecting organization entities if the brand contains some sort of pronoun \ ownership (my). Even when directly calling it a company to supply context. Sometimes it thinks it is a cardinal, other times money.

Screen Shot 2021-11-16 at 4 48 06 PM

Or in the demo:

Screen Shot 2021-11-16 at 4 49 19 PM

Passing it any sentence with brand names that contain this language appears to introduce a lot of consistency issues.

peterolson commented 2 years ago

zh_core_web_trf is not detecting sentence boundaries correctly in Chinese.

nlp = spacy.load("zh_core_web_trf")
doc = nlp("我是你的朋友。你是我的朋友吗?我不喜欢喝咖啡。")

This should be three separate sentence, but the sents property only has one sentence.

HaakonME commented 2 years ago

Default stop words in Norwegian bokmål (nb) in Spacy contain important entities, e.g. France, Germany, Russia, Sweden and USA, police district, important units of time, e.g. months and days of the week, and organisations.

Nobody expects their presence among the default stop words. There is a danger of users complying with the general recommendation of filtering out stop words, while being unaware of filtering out important entities from their data.

The error candidates need to be removed from the default list of stop words, please see attached spreadsheet, which contains both Norwegian bokmål, English, if it is an error candidate, and a short comment about why.

While infecting the default list of stop words could be considered an attack vector, a way of "poisoning the well", this is probably due to a local stop word list having been committed to the central repository at some time by someone.

Below are the steps needed to reproduce the list of stop words in Norwegian bokmål.

Stop words in Norwegian Bokmål

# Import Spacy

import spacy

# Import Norwegian bokmål from Norwegian language

from spacy.lang.nb import Norwegian

# Importing stop words from Norwegian bokmål language

spacy_stopwords = spacy.lang.nb.stop_words.STOP_WORDS

# Printing the total number of stop words:

print('Default number of stop words in Norwegian bokmål in Spacy: %d' % len(spacy_stopwords))

# Printing stop words:

print('Default stop words in Norwegian bokmål in Spacy: %s' % list(spacy_stopwords)[:249])

Default stop words in Norwegian bokmål in Spacy: ['har', 'fjor', 'dem', 'får', 'oss', 'det', 'gikk', 'svært', 'tillegg', 'fem', 'fram', 'noe', 'ifølge', 'kontakt', 'og', 'få', 'ut', 'blant', 'fikk', 'være', 'mellom', 'videre', 'tyskland', 'der', 'tid', 'mot', 'bak', 'mål', 'ikke', 'laget', 'saken', 'landet', 'utenfor', 'bris', 'hennes', 'kom', 'seks', 'ha', 'hva', 'leder', 'å', 'denne', 'gjør', 'regjeringen', 'del', 'sted', 'man', 'funnet', 'prosent', 'bare', 'satt', 'gå', 'menn', 'tirsdag', 'nok', 'vært', 'her', 'en', 'ser', 'fredag', 'veldig', 'at', 'også', 'komme', 'først', 'kort', 'annen', 'gjennom', 'nye', 'når', 'kunne', 'annet', 'oslo', 'igjen', 'skulle', 'frankrike', 'i', 'et', 'klart', 'land', 'henne', 'meg', 'kveld', 'uten', 'president', 'drept', 'fire', 'kroner', 'under', 'fotball', 'fortsatt', 'ta', 'gjort', 'var', 'blir', 'politiet', 'av', 'fra', 'etter', 'sett', 'eller', 'bedre', 'inn', 'mens', 'andre', 'ny', 'på', 'til', 'ligger', 'helt', 'personer', 'ingen', 'ved', 'god', 'ville', 'and', 'vant', 'kvinner', 'som', 'politidistrikt', 'tror', 'slik', 'tre', 'tatt', 'løpet', 'store', 'viktig', 'kl', 'siste', 'måtte', 'like', 'for', 'flere', 'lørdag', 'millioner', 'allerede', 'usa', 'mars', 'seg', 'mannen', 'samme', 'sier', 'stor', 'mandag', 'jeg', 'noen', 'mange', 'mennesker', 'hvorfor', 'vi', 'ja', 'ntb', 'år', 'dette', 'beste', 'neste', 'står', 'litt', 'kampen', 'by', 'nå', 'sa', 'selv', 'vil', 'mye', 'gang', 'opp', 'bli', 'ble', 'er', 'godt', 'siden', 'russland', 'de', 'la', 'ett', 'stedet', 'før', 'norske', 'om', 'opplyser', 'ham', 'ned', 'kommer', 'rundt', 'tilbake', 'du', 'hans', 'kamp', 'minutter', 'gjøre', 'gjorde', 'september', 'den', 'sitt', 'sammen', 'hvor', 'to', 'så', 'han', 'sin', 'samtidig', 'viser', 'da', 'dag', 'grunn', 'alle', 'norge', 'msci', 'fått', 'hele', 'går', 'men', 'mener', 'norsk', 'se', 'ønsker', 'gi', 'hun', 'disse', 'hadde', 'plass', 'både', 'alt', 'torsdag', 'første', 'skal', 'må', 'søndag', 'kan', 'vår', 'senere', 'langt', 'tok', 'folk', 'dermed', 'med', 'mer', 'sverige', 'blitt', 'poeng', 'enn', 'over', 'runde', 'sine', 'tidligere', 'skriver', 'onsdag', 'hvordan'] ` 2021-12-06 NLP Spacy - stop words in Norwegian bokmål model - error candidates.xlsx

EDIT: Wow, these stop word errors have been in the Norwegian bokmål file since 2017! o_O See https://github.com/explosion/spaCy/blob/f46ffe3e893452bf0c171c6c7fcf3b0e458c8f9e/spacy/lang/nb/stop_words.py

svlandeg commented 2 years ago

Hi @HaakonME !

There is a danger of users complying with the general recommendation of filtering out stop words, while being unaware of filtering out important entities from their data.

I do want to point out that we don't typically recommend filtering out stop words, as with today's modern neural network approaches this is rarely needed or even useful. That said, some users do rely on them for various preprocessing needs, and I definitely agree with you that they should not contain meaningful words.

Default stop words in Norwegian bokmål (nb) in Spacy contain important entities, e.g. France, Germany, Russia, Sweden and USA, police district, important units of time, e.g. months and days of the week, and organisations.

If you would feel up to the challenge, we'd appreciate a PR to address some of the most obvious mistakes in the stop word lists. Ideally, that PR should be based off of our develop branch, because we consider changing the current stop words as slightly breaking, and would keep the change for 3.3 (in contrast, the current master branch will power the next 3.2.1 release).

If you need help for creating the PR, I could recommend reading the section over at https://github.com/explosion/spaCy/blob/master/CONTRIBUTING.md#getting-started and we can try to guide you as well :-)

HaakonME commented 2 years ago

Hi @svlandeg !

I have proposed a change to remove NER words from Norwegian stop words in the develop branch as suggested. :-)

Cheers, Haakon

polm commented 2 years ago

@peterolson Sorry for the late reply, but thanks for reporting this. It does seem that the zh trf model really avoids recognizing short sentences.

We took a look at our training data (OntoNotes) and didn't find anything obviously wrong, but we'll keep looking at it.

peterolson commented 2 years ago

Spanish tokenization is broken when there is no space between question sentences "?¿"

nlp = spacy.load("es_dep_news_trf")
doc = nlp("¿Qué quieres?¿Por qué estás aquí?")

quieres?¿Por is treated as one token, but there should be a sentence boundary between "?" and "¿", and "quieres" and "Por" should be separate tokens.

narayanacharya6 commented 2 years ago

NER recognizing 's as an entity in the en_core_web_sm and en_core_web_lg models. Example below:

import spacy
content = """3 WallStreetBets Stocks to Trade | Markets Insider InvestorPlace - Stock Market News, Stock Advice & Trading Tips
What’s the next big thing on Wall Street? These days it might just be what’s trending or more specifically, receiving big-time mentions on WallStreetBets. Or not. The name in question might already be a titan of commerce.
Today let’s take a look at three such WallStreetBets stocks’ price charts and determine what’s technically hot and what’s not for your portfolio.
Reddit’s r/WallStreetBets. The seat-of-your-pants trading forum has made quite the name for itself in 2021. But you don’t need me to tell you that, right? Right."""

nlp_sm = spacy.load("en_core_web_sm")
nlp_md = spacy.load("en_core_web_md")
nlp_lg = spacy.load("en_core_web_lg")

nlp_sm(content).ents
Out[16]: (3, Stock Advice & Trading Tips, Today, ’s, three, 2021)

nlp_md(content).ents
Out[17]: (3, Stock Advice & Trading Tips, Today, three, 2021)

nlp_lg(content).ents
Out[18]: (3, These days, Today, ’s, three, Reddit, 2021)

Version Info:

pip list | grep spacy
spacy                             3.0.6
spacy-alignments                  0.8.3
spacy-legacy                      3.0.8
spacy-stanza                      1.0.0
spacy-transformers                1.0.2
polm commented 2 years ago

@narayanacharya6 Cannot reproduce with 3.2. Can you upgrade and try again? Also include your model versions (spacy info).

Note that ’s and 's are not the same, and the non-ASCII version is probably not in our training data. I suspect we fixed this with character augmentation at some point.

narayanacharya6 commented 2 years ago

Outputs in previous comment were based on model version 3.0.0. Tried version 3.2.0 - and ’s is no longer identified as entity. Thanks!

cyriaka90 commented 2 years ago

For the German sentence "Die Ärmel der Strickjacke haben am Armabschluss ein Bündchen." in v3.2.1 "Die Ärmel" is parsed as Fem Singular instead of Masc Plural; in v3.1.4 the determiner "Die" was correctly parsed as Masc Plural ("Case=Nom|Definite=Def|Gender=Masc|Number=Plur|PronType=Art").

For the English sentence "Kennedy got killed.", "got" is lemmatized to "got" instead of "get".

saurav-chakravorty commented 2 years ago

Sorry for posting an unrelated point here, but I could not figure out a better place. Is there a reference to the model architecture / training code for the public models published by Spacy (e.g. 'en_core_web_md'). I looked at the spacy model repo, but that has models files and meta information, not the actual training code.

polm commented 2 years ago

@saurav-chakravorty If you have a question it's better to open a new Discussion than to post in an unrelated thread.

The training code is not public, partly because the training data requires a license (like OntoNotes for English), partly because a lot of it is infra-related and not of public interest.

mgrojo commented 2 years ago

Stop words in Spanish contain many significant words

The list of stop words in Spanish contains many not very frequent verb forms and unusual words. Compared to the English list, there are many more words and many of them seem meaningful. It's a very strange selection.

Meaningful not very frequent words:

It contains even misspelled words (and the kind of misspell which are not frequent):

Update: I also noticed that there aren't any one-letter stop words, while in English, 'a' and 'i' are included in the list. In Spanish, these letters could be considered stop words:

https://github.com/explosion/spaCy/blob/master/spacy/lang/es/stop_words.py

polm commented 2 years ago

@mgrojo Thanks for pointing that out! If you'd like to open a PR we'd be happy to review it.

mgrojo commented 2 years ago

@polm Thanks. I've already made that pull request.

dblandan commented 2 years ago

Some weirdness in de_core_news_md-3.3.0... I'm interested in lemmas, and I found Hässliche varies depending on the context:

>>> nlp = spacy.load('de_core_news_md')
>>> [(x.lemma_, x.pos_) for x in nlp('Die neuste philosofische Prägung wird Hässliche genannt.')]
[('der', 'DET'), ('neuste', 'ADJ'), ('philosofisch', 'ADJ'), ('Prägung', 'NOUN'), ('werden', 'AUX'), ('hässliche', 'NOUN'), ('nennen', 'VERB'), ('--', 'PUNCT')]
>>> [(x.lemma_, x.pos_) for x in nlp('die Hässliche')]
[('der', 'DET'), ('Hässliche', 'NOUN')]
adrianeboyd commented 2 years ago

@dblandan The v3.3 German models switched from a lookup lemmatizer that only used the word form (no context) to a statistical lemmatizer where the output does depend on the context.

dblandan commented 2 years ago

@dblandan The v3.3 German models switched from a lookup lemmatizer that only used the word form (no context) to a statistical lemmatizer where the output does depend on the context.

So there are different lexical entries for hässliche (NOUN) and Hässliche (NOUN), and one of them is capitalized while the other isn't. I'm ok with there being different entries, but I don't understand why one isn't capitalized given that it's still a noun. :thinking:

The adjectival form lemmatizes correctly to hässlich.

For reference:

hässlich  ADJ  17149702774860831989
hässliche NOUN 5552098829343672028
Hässliche NOUN 17159517463969337747
adrianeboyd commented 2 years ago

The difference is that it's not looking up word forms in a table anymore, so it's not just based on an entry related to the POS or the word form. The lemmatizer is a statistical model like the tagger that uses the context to predict the lemmas based on the training data. For more details about how it works: https://explosion.ai/blog/edit-tree-lemmatizer

dblandan commented 2 years ago

I see. I knew that the edit-tree lemmatizer was coming; I'm still surprised about this particular output. I'll just handle it in post-processing. Thanks for the reply! :smile:

mathcass commented 2 years ago

👋🏻 🤗

Let me know if there's a better place for this. I came across odd behavior from the English lemmatizer that seemed worth reporting.

Here's minimal reproduction steps to see that in certain circumstances the lemmatizer predicts/maps "guys" -> "you"

>>> import spacy
>>> spacy.__version__
'3.2.4'
>>> nlp = spacy.load("en_web_core_md")
>>> nlp("The guys all")[1].lemma_
'you'