Closed BoyanH closed 1 year ago
Thanks – that's definitely strange, especially the fact that it changes when the sentence is parsed again. I wonder if this could be an interaction with spaCy's existing tokenizer exceptions and the way the lexical entries are cached 🤔
The underlying difficulty here is that spaCy assumes the tokenization to always be non-destructive – so all whitespace is preserved in a token attribute and the original text and always be restored. The tokenization of the StanfordNLP models doesn't always align with the input text. So you can have a string "zum"
, which gets split into ["zu", "dem"]
(ass opposed to ["zu", "m"]
). It also doesn't preserve whitespace, so we have to guess that (see here – we currently check if the tokens align and if not, do not try to restore the whitespace).
It seems that the problem is rooted in the functionget_tokens_with_heads(self, snlp_doc)
(in language.py) . There is an iteration over the token.words
which in the case of a contracted token contains the expanded two tokens:
cf. the Token object:
[
{
"id": "2-3",
"text": "zum",
"ner": "O",
"misc": "start_char=4|end_char=7"
},
{
"id": "2",
"text": "zu",
"lemma": "zu",
"upos": "ADP",
"xpos": "APPR",
"head": 4,
"deprel": "case"
},
{
"id": "3",
"text": "dem",
"lemma": "der",
"upos": "DET",
"xpos": "ART",
"feats": "Case=Dat|Definite=Def|Gender=Neut|Number=Sing|PronType=Art",
"head": 4,
"deprel": "det"
}
]
... and the Token.words:
[{
"id": "2",
"text": "zu",
"lemma": "zu",
"upos": "ADP",
"xpos": "APPR",
"head": 4,
"deprel": "case"
},
{
"id": "3",
"text": "dem",
"lemma": "der",
"upos": "DET",
"xpos": "ART",
"feats": "Case=Dat|Definite=Def|Gender=Neut|Number=Sing|PronType=Art",
"head": 4,
"deprel": "det"
}]
as we can see, the NER information is on the contracted form (which is not part of the Token.words)... therefore the alignment mismatch. Hence, one would have to decide which tokenisation is the one spacy should take and what information to pass on. E.g. for the case at hand, one could come up with the idea to create something like :
{
"id": "2",
"text": "zum",
"lemma": "zu dem",
"upos": "ADP",
"xpos": "APPR",
"head": 3,
"deprel": "case"
"ner": "O",
"misc": "start_char=4|end_char=7"
}
NOTE: I just set the head to 3, since the re-contraction of the token would lead to either renew the enumeration (like here: since we kick out one of them we would need to incorporate this in the offset calculation), or - and I think this would be worse - to have no token number 3 and leave the head pointer at 4 ... However, I am not sure which way is preferred.
Just going through some older issues...
I think this is fixed at the latest in spacy-stanza
v1. I can't reproduce it with v1.0.4.
Please feel free to reopen if you're still running into issues!
I downloaded and loaded the german model as described in the docs.
When parsing german contractions such as "im", "am", "zum" etc., I noticed a weird behavior. The first time around everything is fine, but on consecutive parses, some tokens are being duplicated and some omited. It's best to look at an example.
Parsing other contractions afterwards also doesn't work.
But parsing other sentences with no contractions is alright.
Here is a list of german contractions. However, it doesn't break for all of them, as some are not split into separate tokens.
P.S.: It shouldn't really matter, but I'm running this in a jupyter notebook.