Tokenization error using tokenizer = "horizontal"

If you use the tokenizer/tagger/parser options, please visit the official UDPipe docs. Options for the tokenizer are documented at http://ufal.mff.cuni.cz/udpipe/users-manual#run_udpipe_tokenizer section '1.3. Tokenizer' and section '1.4 Input Formats'

These all have a different meaning:

library(udpipe)
udpipe_download_model("hungarian-szeged")
udpipe("8/2013. (I. 30.) EMMI rendelet:", "hungarian-szeged-ud-2.4-190531.udpipe")
udpipe("8/2013. (I. 30.) EMMI rendelet:", "hungarian-szeged-ud-2.4-190531.udpipe", tokenizer = "tokenizer")
udpipe("8/2013. (I. 30.) EMMI rendelet:", "hungarian-szeged-ud-2.4-190531.udpipe", tokenizer = "tokenizer=ranges")
udpipe("8/2013. (I. 30.) EMMI rendelet:", "hungarian-szeged-ud-2.4-190531.udpipe", tokenizer = "horizontal")
udpipe("8/2013. (I. 30.) EMMI rendelet:", "hungarian-szeged-ud-2.4-190531.udpipe", tokenizer = "vertical")

If you use tokenizer="horizontal", doc says "each sentence on a separate line, with tokens separated by spaces. In order to allow spaces in tokens, Unicode character 'NO-BREAK SPACE' (U+00A0) is considered part of token and converted to a space during loading" meaning you already tokenised it with another tokeniser. If you already tokenised your data, example is shown at https://cran.r-project.org/web/packages/udpipe/vignettes/udpipe-annotation.html (see section 'My text data is already tokenised')

bnosac / udpipe

Tokenization error using tokenizer = "horizontal" #58