bnosac / udpipe

R package for Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing Based on the UDPipe Natural Language Processing Toolkit
https://bnosac.github.io/udpipe/en
Mozilla Public License 2.0
213 stars 33 forks source link

Force tagger to not split words made up of numbers and letters #80

Closed sanchez5674 closed 4 years ago

sanchez5674 commented 4 years ago

Hi,

Is there a way to prevent udpipe from breaking up names made up of numbers and letters? I have sentences that contain company names like 3DS and the POS tagger separates the name into 3: NUM and DS: NOUN.

Thanks for the help.

Carlos

jwijffels commented 4 years ago

If you prefer to use another tokeniser, you can just use another tokenizer. This is shown in Section 'My text data is already tokenised' https://cran.r-project.org/web/packages/udpipe/vignettes/udpipe-annotation.html Just put your tokens in a list (like with the use of strsplit) and you can specify tokenizer = "vertical" or tokenizer = "horizontal"