Words splitted by tokenizer

mustaszewski commented 6 years ago

I have realized that using for example the Polish model (https://raw.githubusercontent.com/jwijffels/udpipe.models.ud.2.0/master/inst/udpipe-ud-2.0-170801/polish-ud-2.0-170801.udpipe), the tokenizer splits certain complex verbs. For example, the token "Chciałbym" is splitted into three parts, i.e. "Chciał", "by" and "m", each of which is identified as a separate token with its own token ID, lemma and POS information. The original word token as it appears in the text, in this example "Chciałbym", however, receives to lemma and POS information. For clarity, I pasted the annotated data frame and attached a screnshot.

paragraph_id sentence_id token_id token lemma upos xpos 1: 1 2 1-3 Chciałbym NA NA NA 2: 1 2 1 Chciał chcieć VERB praet:sg:m1:imperf 3: 1 2 2 by być AUX qub 4: 1 2 3 m być AUX aglt:sg:pri:imperf:nwok 5: 1 2 4 w w ADP prep:acc:nwok 6: 1 2 5 sposób sposób NOUN subst:sg:acc:m3 7: 1 2 6 bardzo bardzo ADV adv:pos 8: 1 2 7 jednoznaczny jednoznaczny ADJ adj:sg:nom:m3:pos screenshot from 2018-02-22 16-40-53

Is there a way to surpress this behaviour, thus preventing the tokenizer from splitting such verbs? I am only interested in the original form of such words (i.e. "Chciałbym") without the suffixes being truncated from the verb and tagged/lemmatised independently?

jwijffels commented 6 years ago

Indeed, these are called multi-word tokens, you can see it appearing in the token-id. It says 1-3 for Chciałbym, indicating a multi-word token combining token id 1 and 3. That's a property of the CONLLU data that multi-word tokens will be outputted next to the terms which compose the multi-word token. And that's exactly what UDPipe is training. There is no option to turn this off. If you want the pos tag/lemma/... what you can do is take the pos tag/lemma from tokens 1 and 3 and make a decision on what the final pos tag/lemma then should be

mustaszewski commented 6 years ago

Thanks for your answer. That's what I suspected based on the udpipe manual and the CoNLL format specifications. Although from the perspective of linguistic theory splitting multi-word tokens is sensible, in some scenarios one wants to keep the entire multi-word token. I have successfully written a post-processing function to get rid of the splitted suffixes and to assign its POS/lemma values to the unsplitted ones, but since I'm rather new to coding and R, I'm not sharing it here because it is probably not a very clean solution.

bnosac / udpipe

Words splitted by tokenizer #17