bnosac / udpipe

R package for Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing Based on the UDPipe Natural Language Processing Toolkit
https://bnosac.github.io/udpipe/en
Mozilla Public License 2.0
209 stars 33 forks source link

parsing a file with the input format 'conllu' #61

Closed FafaSame closed 5 years ago

FafaSame commented 5 years ago

I was wondering whether it is possible to parse the files which are pre-tokenized and are in conllu format? This option is possible in udpipe online service, where one can choose the input format as 'CoNLL-U'. An example is given below: 1 In 2 the 3 end 4 , 5 we 6 should 7 acknowledge 8 that 9 currently 10 expenses 11 are 12 still 13 quite 14 high 15 .

jwijffels commented 5 years ago

Yes you can, you need to pass the format on to the tokenizer argument. As follows:

library(udpipe)
model <- udpipe_download_model("english-ewt")
model <- udpipe_load_model(model$file_model)

file <- system.file(package = "udpipe", "dummydata", "traindata.conllu")
x <- readLines(file)
#x <- readLines("yourownfile.conllu")
x <- paste(x, collapse = "\n")
cat(x)
x <- udpipe_annotate(model, x = x, tokenizer = "conllu=v2")
cat(x$conllu)