explosion / spacy-stanza

💥 Use the latest Stanza (StanfordNLP) research models directly in spaCy
MIT License
723 stars 59 forks source link

Issue with Whitespaces for German #20

Closed g3rfx closed 4 years ago

g3rfx commented 4 years ago

Hello,

it seems like there exists an issue with the trailing whitespaces of tokens in case of, e.g., German.

import sys
import traceback
import stanfordnlp
from spacy_stanfordnlp import StanfordNLPLanguage

def stanford_tokenizer(text, language):
    snlp = stanfordnlp.Pipeline(lang=language)
    nlp = StanfordNLPLanguage(snlp)
    try:
        doc = nlp(text)
        tokenized_doc = ("".join([token.text_with_ws for token in doc]))
    except:
        traceback.print_exc()
        sys.exit()
    return tokenized_doc

text = """ Juliana kommt aus Paris. Das ist die Hauptstadt von Frankreich. In diesem Sommer macht sie einen Sprachkurs in Freiburg. 
Das ist eine Universitätsstadt im Süden von Deutschland. Es gefällt ihr hier sehr gut. Morgens um neun beginnt der Unterricht, um vierzehn Uhr ist er zu Ende.
In ihrer Klasse sind außer Juliana noch 14 weitere Schüler, acht Mädchen und sechs Jungen. Sie kommen alle aus Frankreich, aber nicht aus Paris.
"""

tokenized_text = stanford_tokenizer(text, "de")
print(tokenized_text)

Output: Juliana kommt aus Paris . Das ist die Hauptstadt von Frankreich . In diesem Sommer macht sie einen Sprachkurs in Freiburg . Das ist eine Universitätsstadt in dem Süden von Deutschland . Es gefällt ihr hier sehr gut . Morgens um neun beginnt der Unterricht , um vierzehn Uhr ist er zu Ende . In ihrer Klasse sind außer Juliana noch 14 weitere Schüler , acht Mädchen und sechs Jungen . Sie kommen alle aus Frankreich , aber nicht aus Paris .

As one can see, the periods at the end of the sentences are put with one additional whitespace to the last token of a sentence. The same holds for other punctuation symbols while Spacy would detect whether there actually exists a trailing whitespace.

Source of sample text: https://lingua.com/german/reading/

ines commented 4 years ago

This is kind of expected at the moment if the tokenization isn't fully aligned with the original text. In spaCy, tokenization is non-destructive, so the output will always match the original input. This is not the case for the StanfordNLP models, so they might produce tokenization like "zum"["zu", "dem"]. In your case, the problem may be that the newlines are swallowed.

The whitespace information also isn't included, so this wrapper needs to try and reconstruct it from the original text. This works okay if the output matches the input, but it becomes more difficult if the output is different. So we currently only check if the tokens align and if not, default the whitespace to True for all tokens:

https://github.com/explosion/spacy-stanfordnlp/blob/aa7371165778cc281536491f572eb3ff71f3c5f3/spacy_stanfordnlp/language.py#L164-L170

There's probably some room for improvement here and we could add more logic to try resolve as much of the alignment as possible, and normalise the whitespace to be able to handle examples like the one you posted. We actually just wrote some more complex alignment logic for spacy-pytorch-transformers, so maybe we can repurpose some of that.

g3rfx commented 4 years ago

Thank you very much for your fast response. The point of this issue is that I can only observe it for the German language. For English, e.g, the whitespaces are fine in the returned text. Funnily, I also observed this issue with spacy-udpipe, which, to the best of my understanding, is based on the code of spacy-stanfordnlp. So, I am a bit confused as I do not see any special handling for German in the code base that could lead to this problem.

ines commented 4 years ago

So, I am a bit confused as I do not see any special handling for German in the code base that could lead to this problem.

I think the explanation lies in the data: both UDPipe and StanfordNLP train on the German Universal Dependencies data. And that data includes includes destructive tokenization like "zum"["zu", "dem"]. So this is also what the model outputs. Because the produced tokens don't match the original text, spaCy can't easily reconstruct the original whitespace. At least not using the simple alignment logic.

g3rfx commented 4 years ago

Interesting point. Thanks a lot for your explanation. So I think this issue is not fixable. Therefore, I am fine with closing this issue.