TakeLab / spacy-udpipe

spaCy + UDPipe
MIT License
159 stars 11 forks source link

Allow pre-tokenised text #18

Closed BramVanroy closed 4 years ago

BramVanroy commented 4 years ago

Similar to https://github.com/TakeLab/spacy-udpipe/issues/13, it would be nice to have an option to disable the tokenizer in some way and to use tokens (list of string) directly as input to the rest of the pipeline. For instance, in spaCy, we can easily swap out the tokenizer:

nlp.tokenizer = nlp.tokenizer.tokens_from_list

This would be helpful!

It would also be great if this could be used together with the aforementioned issue (https://github.com/TakeLab/spacy-udpipe/issues/13) so that you can pass pretokenized, presegmented text.

asajatovic commented 4 years ago

Enabled in #19