Closed sanjmeh closed 6 years ago
Are you using UDPipe from R? That is not one of our official bindings -- it is contributed by a UDPipe user. Therefore, I am unable to advise what is wrong -- but UDPipe binary would not behave like this (it creates documents only for individual files).
I am having difficulty in getting
udpipe
English model to annotate text into correct chunks of sentences. I have attached the raw text (a90.txt
) file on which I am runningudpipe_annotate
As you see in the next file
a90_term.txt
the CONLLU file format contains manydoc_id
s for the same doc. I do not understand whydoc_id
is getting changed between lines of text.The above two are tagged as two sentences while they are part of the same. The first part is tagged as doc id 4, para 1 sentence 2. The next line is tagged as doc 5, para 1, sentence 1.
Following commands were used to generate the files.
tagger <- udpipe_load_model(file = "english-ud-2.0-170801.udpipe")
udpipe_annotate(object = tagger,x = a90_facts) %>% as.data.table ....
where
a90_facts
is the object containing the raw character vector. Same vector is dumped in the file a90.txt (attached) a90_term.txt a90.txt