Offset misalignment in NER StanzaLanguage Tokenizer

aishwarya-agrawal commented 4 years ago

text = """ Tobacco/Smoke Exposure Family members smoke indoors, Daily. Caffeine use Coffee,"""
doc = snlp(text)
print([(e.text, e.label_, text[e.start_char:e.end_char]) for e in doc.ents])

Gives the output:

UserWarning: Can't set named entities because the character offsets don't map to valid tokens produced by the Stanza tokenizer:
Ents: [('Caffeine use', 'disease', 61, 73)]
doc = snlp(text)
[]

On printing the two texts i.e. snlp_doc.text, doc.text Getting following texts:

snlp_doc.text = " Tobacco/Smoke Exposure Family members smoke indoors, Daily. Caffeine use Coffee,"
doc.text =   "  Tobacco / Smoke Exposure Family members smoke indoors , Daily . Caffeine use Coffee ,"

Because of which above error is coming and we are losing the identified entities Even with basic configs mentioned in readme:

redadmiral commented 4 years ago

I encounter the same problem using the german language model. When there are synaeresises in the input text the model replaces them by the two originating words, but apparently doesn't update the input's character offset.

The input Hans Müller isst gerne Vanilleeis am Hamburger Dom. returns the entity dem Hamburger instead of Hamburger Dom:

In [21]: doc = nlp("Hans Müller isst gerne Vanilleeis am Hamburger Dom.")                        

In [22]: doc.ents                                                                                
Out[22]: (Hans Müller, dem Hamburger)

While everything is fine as long as the synaeresis am is already split up to an dem in the input:

In [23]: doc = nlp("Hans Müller isst gerne Vanilleeis an dem Hamburger Dom.")                    

In [24]: doc.ents                                                                                
Out[24]: (Hans Müller, Hamburger Dom)

This seems to be a problem with the spacy wrapper since the stanza package itself provides the correct output:

In [3]: doc = nlp("Hans Müller isst gerne Vanilleeis am Hamburger Dom")                          

In [4]: doc.ents                                                                                 
Out[4]: 
[{
   "text": "Hans Müller",
   "type": "PER",
   "start_char": 0,
   "end_char": 11
 },
 {
   "text": "Hamburger Dom",
   "type": "LOC",
   "start_char": 37,
   "end_char": 50
 }]

It seems like the german model does not have any issues with special characters/punctuaion as @aishwarya-agrawal has encountered.

In [28]: doc = nlp("Hans Müller isst gerne Vanilleeis/Himbeereis am Hamburger Dom.")             

In [29]: doc.text                                                                                
Out[29]: 'Hans Müller isst gerne Vanilleeis/Himbeereis an dem Hamburger Dom '

aishwarya-agrawal commented 4 years ago

@redadmiral Please try this doc = nlp(" Hans Müller isst gerne Vanilleeis/Himbeereis am Hamburger Dom.")

Please notice the space at the beginning of the sentence

redadmiral commented 4 years ago

Oh, okay – this leads to the same warning you encountered:

In [30]: doc = nlp(" Hans Müller isst gerne Vanilleeis/Himbeereis am Hamburger Dom.")            
<ipython-input-30-b238c1353442>:1: UserWarning: Can't set named entities because the character offsets don't map to valid tokens produced by the Stanza tokenizer:
Words: ['Hans', 'Müller', 'isst', 'gerne', 'Vanilleeis/Himbeereis', 'an', 'dem', 'Hamburger', 'Dom']
Entities: [('Hans Müller', 'PER', 1, 12), ('Hamburger Dom', 'LOC', 49, 62)]
  doc = nlp(" Hans Müller isst gerne Vanilleeis/Himbeereis am Hamburger Dom.")

explosion / spacy-stanza

Offset misalignment in NER StanzaLanguage Tokenizer #33