Issue with Whitespaces for German

Hello,

it seems like there exists an issue with the trailing whitespaces of tokens in case of, e.g., German.

import sys
import traceback
import spacy_udpipe

def udpipe_tokenizer(doc, language):
    nlp = spacy_udpipe.load(language)
    try:
        doc = nlp(doc)
        tokenized_doc =("".join([token.text_with_ws for token in doc]))
    except:
        traceback.print_exc()
        sys.exit()
    return tokenized_doc

text = """ Juliana kommt aus Paris. Das ist die Hauptstadt von Frankreich. In diesem Sommer macht sie einen Sprachkurs in Freiburg. 
Das ist eine Universitätsstadt im Süden von Deutschland. Es gefällt ihr hier sehr gut. Morgens um neun beginnt der Unterricht, um vierzehn Uhr ist er zu Ende.
In ihrer Klasse sind außer Juliana noch 14 weitere Schüler, acht Mädchen und sechs Jungen. Sie kommen alle aus Frankreich, aber nicht aus Paris.
    """

tokenized_text = udpipe_tokenizer(text,"de")
print(tokenized_text)

Output:

Juliana kommt aus Paris . Das ist die Hauptstadt von Frankreich . In diesem Sommer macht sie einen Sprachkurs in Freiburg . Das ist eine Universitätsstadt in dem Süden von Deutschland . Es gefällt ihr hier sehr gut . Morgens um neun beginnt der Unterricht , um vierzehn Uhr ist er zu Ende . In ihrer Klasse sind außer Juliana noch 14 weitere Schüler , acht Mädchen und sechs Jungen . Sie kommen alle aus Frankreich , aber nicht aus Paris .

As one can see, the periods at the end of the sentences are put with one additional whitespace to the last token of a sentence. The same holds for other punctuation symbols while Spacy would detect whether there actually exists a trailing whitespace.

(Source of sample text: https://lingua.com/german/reading/)

I tested various text samples for Spanish, English and Croatian language. The issue is reproducible only on some German text samples. The only logical cause of the error is the underlying UDPipe model for the German language. I am afraid there is not much I can do about that given spacy-udpipe is a wrapper library. As you mention the problem appears

in case of, e.g., German.

I am wondering have you by any chance encountered the same issue for any other languages?

I don't think this is an UDPipe issue. Tokenisation works perfectly with plain UDPipe. UDPipe puts in the misc information elements like SpaceAfter=No. I don't think this is taken into account when one uses text_with_ws

Let me show this with the R wrapper to UDPipe instead:

library(udpipe)
x <- udpipe(x = "Juliana kommt aus Paris. Das ist die Hauptstadt von Frankreich. In diesem Sommer macht sie einen Sprachkurs in Freiburg. Das ist eine Universitätsstadt im Süden von Deutschland. Es gefällt ihr hier sehr gut. Morgens um neun beginnt der Unterricht, um vierzehn Uhr ist er zu Ende. In ihrer Klasse sind außer Juliana noch 14 weitere Schüler, acht Mädchen und sechs Jungen. Sie kommen alle aus Frankreich, aber nicht aus Paris.", "german")
x[, c("sentence_id", "token_id", "token", "misc")]
 sentence_id token_id             token            misc
           1        1           Juliana            <NA>
           1        2             kommt            <NA>
           1        3               aus            <NA>
           1        4             Paris   SpaceAfter=No
           1        5                 .            <NA>
           2        1               Das            <NA>
           2        2               ist            <NA>
           2        3               die            <NA>
           2        4        Hauptstadt            <NA>
           2        5               von            <NA>
           2        6        Frankreich   SpaceAfter=No
           2        7                 .            <NA>
           3        1                In            <NA>
           3        2            diesem            <NA>
           3        3            Sommer            <NA>
           3        4             macht            <NA>
           3        5               sie            <NA>
           3        6             einen            <NA>
           3        7        Sprachkurs            <NA>
           3        8                in            <NA>
           3        9          Freiburg   SpaceAfter=No
           3       10                 .            <NA>
           4        1               Das            <NA>
           4        2               ist            <NA>
           4        3              eine            <NA>
           4        4 Universitätsstadt            <NA>
           4      5-6                im            <NA>
           4        5                in            <NA>
           4        6               dem            <NA>
           4        7             Süden            <NA>
           4        8               von            <NA>
           4        9       Deutschland   SpaceAfter=No
           4       10                 .            <NA>
           5        1                Es            <NA>
           5        2           gefällt            <NA>
           5        3               ihr            <NA>
           5        4              hier            <NA>
           5        5              sehr            <NA>
           5        6               gut   SpaceAfter=No
           5        7                 .            <NA>
           6        1           Morgens            <NA>
           6        2                um            <NA>
           6        3              neun            <NA>
           6        4           beginnt            <NA>
           6        5               der            <NA>
           6        6        Unterricht   SpaceAfter=No
           6        7                 ,            <NA>
           6        8                um            <NA>
           6        9          vierzehn            <NA>
           6       10               Uhr            <NA>
           6       11               ist            <NA>
           6       12                er            <NA>
           6       13                zu            <NA>
           6       14              Ende   SpaceAfter=No
           6       15                 .            <NA>
           7        1                In            <NA>
           7        2             ihrer            <NA>
           7        3            Klasse            <NA>
           7        4              sind            <NA>
           7        5             außer            <NA>
           7        6           Juliana            <NA>
           7        7              noch            <NA>
           7        8                14            <NA>
           7        9           weitere            <NA>
           7       10           Schüler   SpaceAfter=No
           7       11                 ,            <NA>
           7       12              acht            <NA>
           7       13           Mädchen            <NA>
           7       14               und            <NA>
           7       15             sechs            <NA>
           7       16            Jungen   SpaceAfter=No
           7       17                 .            <NA>
           8        1               Sie            <NA>
           8        2            kommen            <NA>
           8        3              alle            <NA>
           8        4               aus            <NA>
           8        5        Frankreich   SpaceAfter=No
           8        6                 ,            <NA>
           8        7              aber            <NA>
           8        8             nicht            <NA>
           8        9               aus            <NA>
           8       10             Paris   SpaceAfter=No
           8       11                 . SpacesAfter=\\n

If you want to reconstruct the original text, you need to take into account that misc information.

Thank you for investigating this issue. Actually, this is the strange point that I could observe this issue for German only. Funnily, I also observed the same issue for the spacy-stanfordnlp package (see: https://github.com/explosion/spacy-stanfordnlp/issues/20) which of course utilizes the respecting StanfordNLP model for German. I would really appreciate it if the existence of white spaces could be handled like in the R Wrapper as @jwijffels demonstrated :) token.text_with_ws and token.whitespace_ are basically the affected attributes. Maybe also span.whitespace_ , as UDPipe seems to support sentence segmentation as well.

Both spacy-stanfordnlp and spacy-udpipe construct a Doc object using words and spaces arguments. An alternative would be to use orths_and_spaces argument instead and utilize SpaceAfter=No mentioned by @jwijffels to hopefully correctly handle whitespaces.

As you mentioned in your first response, the issue strangely happens for some German texts only. In the following code snippet, the first two samples of the texts list are returned correcly for me but the last two are not (sorry for the unformatted form. I did not insert any line breaks in the samples so that the results are not falsified):

import spacy_udpipe
from tqdm import tqdm
import sys
import traceback

def udpipe_tokenizer(docs, language):
    tokenized_docs = []
    nlp = spacy_udpipe.load(language)
    try:
        for doc in nlp.pipe(tqdm(docs, unit="doc", desc="Udpipe tokenizer")):
            tokenized_docs.append("".join([token.text_with_ws for token in doc]))
    except:
        traceback.print_exc()
        sys.exit()
    return tokenized_docs

if __name__ == "__main__":
    texts = ["Nähere Informationen zu den unterschiedlichen Garantierarten können Sie hier nachlesen. Die angegebene Herstellergarantie gilt mindestens deutschlandweit. Die Kontaktdaten für ihre Garantie entnehmen Sie bitte unseren AGB. Gesetzliche Gewährleistungsrechte werden durch eine zusätzliche Herstellergarantie nicht eingeschränkt.",
    "Der Verantwortliche trifft geeignete Maßnahmen, um der betroffenen Person alle Informationen gemäß den Artikeln 13 und 14 und alle Mitteilungen gemäß den Artikeln 15 bis 22 und Artikel 34, die sich auf die Verarbeitung beziehen, in präziser, transparenter, verständlicher und leicht zugänglicher Form in einer klaren und einfachen Sprache zu übermitteln; dies gilt insbesondere für Informationen, die sich speziell an Kinder richten. Die Übermittlung der Informationen erfolgt schriftlich oder in anderer Form, gegebenenfalls auch elektronisch.",
    "Juliana kommt aus Paris. Das ist die Hauptstadt von Frankreich. In diesem Sommer macht sie einen Sprachkurs in Freiburg. Das ist eine Universitätsstadt im Süden von Deutschland. Es gefällt ihr hier sehr gut. Morgens um neun beginnt der Unterricht, um vierzehn Uhr ist er zu Ende. In ihrer Klasse sind außer Juliana noch 14 weitere Schüler, acht Mädchen und sechs Jungen. Sie kommen alle aus Frankreich, aber nicht aus Paris.", 
    "Alle Inhalte, insbesondere die Texte und Bilder von Agenturen, sind urheberrechtlich geschützt und dürfen nur im Rahmen der gewöhnlichen Nutzung des Angebots vervielfältigt, verbreitet oder sonst genutzt werden."]
    outputs = udpipe_tokenizer(texts, "de")
    for output in outputs:
        print(output + "\n")

The output on my Ubuntu 16.04.6 machine with Python 3.6.9 is:

Nähere Informationen zu den unterschiedlichen Garantierarten können Sie hier nachlesen. Die angegebene Herstellergarantie gilt mindestens deutschlandweit. Die Kontaktdaten für ihre Garantie entnehmen Sie bitte unseren AGB. Gesetzliche Gewährleistungsrechte werden durch eine zusätzliche Herstellergarantie nicht eingeschränkt.

Der Verantwortliche trifft geeignete Maßnahmen, um der betroffenen Person alle Informationen gemäß den Artikeln 13 und 14 und alle Mitteilungen gemäß den Artikeln 15 bis 22 und Artikel 34, die sich auf die Verarbeitung beziehen, in präziser, transparenter, verständlicher und leicht zugänglicher Form in einer klaren und einfachen Sprache zu übermitteln; dies gilt insbesondere für Informationen, die sich speziell an Kinder richten. Die Übermittlung der Informationen erfolgt schriftlich oder in anderer Form, gegebenenfalls auch elektronisch.

Juliana kommt aus Paris . Das ist die Hauptstadt von Frankreich . In diesem Sommer macht sie einen Sprachkurs in Freiburg . Das ist eine Universitätsstadt in dem Süden von Deutschland . Es gefällt ihr hier sehr gut . Morgens um neun beginnt der Unterricht , um vierzehn Uhr ist er zu Ende . In ihrer Klasse sind außer Juliana noch 14 weitere Schüler , acht Mädchen und sechs Jungen . Sie kommen alle aus Frankreich , aber nicht aus Paris .

Alle Inhalte , insbesondere die Texte und Bilder von Agenturen , sind urheberrechtlich geschützt und dürfen nur in dem Rahmen der gewöhnlichen Nutzung des Angebots vervielfältigt , verbreitet oder sonst genutzt werden .

I thought maybe there is some formatting issue but I cannot find anything. ~Maybe there is actually something wrong with the UDpipe model for German.~

TakeLab / spacy-udpipe

Issue with Whitespaces for German #4