Tokenization fails in german model for sentences containing contractions

BoyanH commented 5 years ago

Spacy version: 2.1.4
spacy-stanfordnlp version: 0.1.1

I downloaded and loaded the german model as described in the docs.

import stanfordnlp
from spacy_stanfordnlp import StanfordNLPLanguage

stanfordnlp.download('de')
snlp = stanfordnlp.Pipeline(lang="de")
stanford_nlp = StanfordNLPLanguage(snlp)

When parsing german contractions such as "im", "am", "zum" etc., I noticed a weird behavior. The first time around everything is fine, but on consecutive parses, some tokens are being duplicated and some omited. It's best to look at an example.

stanford_nlp('Das zum Beispiel funktioniert nicht.')
# Das zu dem Beispiel funktioniert nicht .

stanford_nlp('Das zum Beispiel funktioniert nicht.')
# Das zu dem zu dem Beispiel funktioniert nicht 

stanford_nlp('Das zum Beispiel funktioniert nicht.')
# Das zu dem zu dem zu dem Beispiel funktioniert 

stanford_nlp('Das zum Beispiel funktioniert nicht.')
# Das zu dem zu dem zu dem zu dem Beispiel funktioniert 

stanford_nlp('Das zum Beispiel funktioniert nicht.')
# Das zu dem zu dem zu dem zu dem zu dem Beispiel 

stanford_nlp('Das zum Beispiel funktioniert nicht.')
# Das zu dem zu dem zu dem zu dem zu dem zu dem Beispiel

stanford_nlp('Das zum Beispiel funktioniert nicht.')
# Das zu dem zu dem zu dem zu dem zu dem zu dem zu 

stanford_nlp('Das zum Beispiel funktioniert nicht.')
# Das zu dem zu dem zu dem zu dem zu dem zu dem zu

Parsing other contractions afterwards also doesn't work.

stanford_nlp('Am Anfang wäre das ok.')
# zu dem zu dem zu dem zu dem

But parsing other sentences with no contractions is alright.

stanford_nlp('Dabei hat man keine Schwierigkeiten.')
# Dabei hat man keine Schwierigkeiten.

Here is a list of german contractions. However, it doesn't break for all of them, as some are not split into separate tokens.

P.S.: It shouldn't really matter, but I'm running this in a jupyter notebook.

ines commented 5 years ago

Thanks – that's definitely strange, especially the fact that it changes when the sentence is parsed again. I wonder if this could be an interaction with spaCy's existing tokenizer exceptions and the way the lexical entries are cached 🤔

The underlying difficulty here is that spaCy assumes the tokenization to always be non-destructive – so all whitespace is preserved in a token attribute and the original text and always be restored. The tokenization of the StanfordNLP models doesn't always align with the input text. So you can have a string "zum", which gets split into ["zu", "dem"] (ass opposed to ["zu", "m"]). It also doesn't preserve whitespace, so we have to guess that (see here – we currently check if the tokens align and if not, do not try to restore the whitespace).

ginlennon commented 4 years ago

It seems that the problem is rooted in the functionget_tokens_with_heads(self, snlp_doc) (in language.py) . There is an iteration over the token.words which in the case of a contracted token contains the expanded two tokens:

cf. the Token object:

[
  {
    "id": "2-3",
    "text": "zum",
    "ner": "O",
    "misc": "start_char=4|end_char=7"
  },
  {
    "id": "2",
    "text": "zu",
    "lemma": "zu",
    "upos": "ADP",
    "xpos": "APPR",
    "head": 4,
    "deprel": "case"
  },
  {
    "id": "3",
    "text": "dem",
    "lemma": "der",
    "upos": "DET",
    "xpos": "ART",
    "feats": "Case=Dat|Definite=Def|Gender=Neut|Number=Sing|PronType=Art",
    "head": 4,
    "deprel": "det"
  }
]

... and the Token.words:

[{
   "id": "2",
   "text": "zu",
   "lemma": "zu",
   "upos": "ADP",
   "xpos": "APPR",
   "head": 4,
   "deprel": "case"
 },
 {
   "id": "3",
   "text": "dem",
   "lemma": "der",
   "upos": "DET",
   "xpos": "ART",
   "feats": "Case=Dat|Definite=Def|Gender=Neut|Number=Sing|PronType=Art",
   "head": 4,
   "deprel": "det"
 }]

as we can see, the NER information is on the contracted form (which is not part of the Token.words)... therefore the alignment mismatch. Hence, one would have to decide which tokenisation is the one spacy should take and what information to pass on. E.g. for the case at hand, one could come up with the idea to create something like :

{
   "id": "2",
   "text": "zum",
   "lemma": "zu dem",
   "upos": "ADP",
   "xpos": "APPR",
   "head": 3,
   "deprel": "case"
   "ner": "O",
    "misc": "start_char=4|end_char=7"
 }

NOTE: I just set the head to 3, since the re-contraction of the token would lead to either renew the enumeration (like here: since we kick out one of them we would need to incorporate this in the offset calculation), or - and I think this would be worse - to have no token number 3 and leave the head pointer at 4 ... However, I am not sure which way is preferred.

adrianeboyd commented 1 year ago

Just going through some older issues...

I think this is fixed at the latest in spacy-stanza v1. I can't reproduce it with v1.0.4.

Please feel free to reopen if you're still running into issues!

explosion / spacy-stanza

Tokenization fails in german model for sentences containing contractions #14