bnosac / udpipe

R package for Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing Based on the UDPipe Natural Language Processing Toolkit
https://bnosac.github.io/udpipe/en
Mozilla Public License 2.0
209 stars 33 forks source link

Tokenization error using tokenizer = "horizontal" #58

Closed nyekil closed 5 years ago

nyekil commented 5 years ago

If your raw text file contains brackets, then the tokenizer = "horizontal" parameter setting interprets brackets as special symbols. Avoid it.

example.txt rendu.txt

The right solution: solution.txt

jwijffels commented 5 years ago

If you use the tokenizer/tagger/parser options, please visit the official UDPipe docs. Options for the tokenizer are documented at http://ufal.mff.cuni.cz/udpipe/users-manual#run_udpipe_tokenizer section '1.3. Tokenizer' and section '1.4 Input Formats'

These all have a different meaning:

library(udpipe)
udpipe_download_model("hungarian-szeged")
udpipe("8/2013. (I. 30.) EMMI rendelet:", "hungarian-szeged-ud-2.4-190531.udpipe")
udpipe("8/2013. (I. 30.) EMMI rendelet:", "hungarian-szeged-ud-2.4-190531.udpipe", tokenizer = "tokenizer")
udpipe("8/2013. (I. 30.) EMMI rendelet:", "hungarian-szeged-ud-2.4-190531.udpipe", tokenizer = "tokenizer=ranges")
udpipe("8/2013. (I. 30.) EMMI rendelet:", "hungarian-szeged-ud-2.4-190531.udpipe", tokenizer = "horizontal")
udpipe("8/2013. (I. 30.) EMMI rendelet:", "hungarian-szeged-ud-2.4-190531.udpipe", tokenizer = "vertical")

If you use tokenizer="horizontal", doc says "each sentence on a separate line, with tokens separated by spaces. In order to allow spaces in tokens, Unicode character 'NO-BREAK SPACE' (U+00A0) is considered part of token and converted to a space during loading" meaning you already tokenised it with another tokeniser. If you already tokenised your data, example is shown at https://cran.r-project.org/web/packages/udpipe/vignettes/udpipe-annotation.html (see section 'My text data is already tokenised')