Issues with dependency tags for pretokenized text

Nithin-Holla commented 3 years ago

When using pretokenized text, the dependency tags turn out to be ROOT for all the tokens. This is however not the case when performing tagging on raw text directly.

Here's a code sample replicating this issue for Swedish:

import spacy_udpipe
spacy_udpipe.download('sv')
sv = spacy_udpipe.load('sv')

text = "Världshandelsorganisationen arbetar med reglering av handel mellan deltagarländerna."
tokens = [token.text for token in sv.tokenizer(text)]

print('Tagging text directly: ', [token.dep_ for token in sv(text)])
print('Tagging pretokenized text: ', [token.dep_ for token in sv(tokens)])

The output is the following:

Tagging text directly:  ['nsubj', 'ROOT', 'case', 'obl', 'case', 'nmod', 'case', 'nmod', 'punct']
Tagging pretokenized text:  ['ROOT', 'ROOT', 'ROOT', 'ROOT', 'ROOT', 'ROOT', 'ROOT', 'ROOT', 'ROOT']

asajatovic commented 3 years ago

@Nithin-Holla The fix is really simple - when using the pre-tokenized text, the input has to be a list of a list of strings (in your case: [tokens]), as it says in the documentation. I've been quite busy lately so I slightly regret taking a while to "fix" this easy issue. :sweat_smile:

Nithin-Holla commented 3 years ago

@asajatovic Yes, thanks!

TakeLab / spacy-udpipe

Issues with dependency tags for pretokenized text #30