CoNLL-UD-2017 / UFAL-UDPipe-1.2

CoNLL 2017 Shared Task Team UFAL-UDPipe-1.2
Mozilla Public License 2.0
1 stars 0 forks source link

udpipe tokenisation is chunking sentences incorrectly #1

Closed sanjmeh closed 6 years ago

sanjmeh commented 6 years ago

I am having difficulty in getting udpipe English model to annotate text into correct chunks of sentences. I have attached the raw text (a90.txt) file on which I am running udpipe_annotate

As you see in the next file a90_term.txt the CONLLU file format contains many doc_ids for the same doc. I do not understand why doc_id is getting changed between lines of text.

  1. He has worked in the pharmaceutical business for over 20 years, and been
  2. resident in Frugalia for over 12.

The above two are tagged as two sentences while they are part of the same. The first part is tagged as doc id 4, para 1 sentence 2. The next line is tagged as doc 5, para 1, sentence 1.

Following commands were used to generate the files.

tagger <- udpipe_load_model(file = "english-ud-2.0-170801.udpipe") udpipe_annotate(object = tagger,x = a90_facts) %>% as.data.table ....

where a90_facts is the object containing the raw character vector. Same vector is dumped in the file a90.txt (attached) a90_term.txt a90.txt

foxik commented 6 years ago

Are you using UDPipe from R? That is not one of our official bindings -- it is contributed by a UDPipe user. Therefore, I am unable to advise what is wrong -- but UDPipe binary would not behave like this (it creates documents only for individual files).