Closed aishwarya-agrawal closed 4 years ago
I encounter the same problem using the german language model. When there are synaeresises in the input text the model replaces them by the two originating words, but apparently doesn't update the input's character offset.
The input Hans Müller isst gerne Vanilleeis am Hamburger Dom.
returns the entity dem Hamburger
instead of Hamburger Dom
:
In [21]: doc = nlp("Hans Müller isst gerne Vanilleeis am Hamburger Dom.")
In [22]: doc.ents
Out[22]: (Hans Müller, dem Hamburger)
While everything is fine as long as the synaeresis am
is already split up to an dem
in the input:
In [23]: doc = nlp("Hans Müller isst gerne Vanilleeis an dem Hamburger Dom.")
In [24]: doc.ents
Out[24]: (Hans Müller, Hamburger Dom)
This seems to be a problem with the spacy wrapper since the stanza
package itself provides the correct output:
In [3]: doc = nlp("Hans Müller isst gerne Vanilleeis am Hamburger Dom")
In [4]: doc.ents
Out[4]:
[{
"text": "Hans Müller",
"type": "PER",
"start_char": 0,
"end_char": 11
},
{
"text": "Hamburger Dom",
"type": "LOC",
"start_char": 37,
"end_char": 50
}]
It seems like the german model does not have any issues with special characters/punctuaion as @aishwarya-agrawal has encountered.
In [28]: doc = nlp("Hans Müller isst gerne Vanilleeis/Himbeereis am Hamburger Dom.")
In [29]: doc.text
Out[29]: 'Hans Müller isst gerne Vanilleeis/Himbeereis an dem Hamburger Dom '
@redadmiral Please try this
doc = nlp(" Hans Müller isst gerne Vanilleeis/Himbeereis am Hamburger Dom.")
Please notice the space at the beginning of the sentence
Oh, okay – this leads to the same warning you encountered:
In [30]: doc = nlp(" Hans Müller isst gerne Vanilleeis/Himbeereis am Hamburger Dom.")
<ipython-input-30-b238c1353442>:1: UserWarning: Can't set named entities because the character offsets don't map to valid tokens produced by the Stanza tokenizer:
Words: ['Hans', 'Müller', 'isst', 'gerne', 'Vanilleeis/Himbeereis', 'an', 'dem', 'Hamburger', 'Dom']
Entities: [('Hans Müller', 'PER', 1, 12), ('Hamburger Dom', 'LOC', 49, 62)]
doc = nlp(" Hans Müller isst gerne Vanilleeis/Himbeereis am Hamburger Dom.")
Gives the output:
On printing the two texts i.e.
snlp_doc.text, doc.text
Getting following texts:Because of which above error is coming and we are losing the identified entities Even with basic configs mentioned in readme: