bnosac / udpipe

R package for Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing Based on the UDPipe Natural Language Processing Toolkit
https://bnosac.github.io/udpipe/en
Mozilla Public License 2.0
209 stars 33 forks source link

Foreign symbols are not parsed well #47

Closed rdatasculptor closed 5 years ago

rdatasculptor commented 5 years ago

I have dutch texts with words like "Carrière". Annotating these texts never was a problem. Until now. I am not sure why or what I did wrong, but suddenly the udipe_annotate changes e.g. Carrière into the token Carri<U+653C><U+3E38>. Any ideas where to look or how to solve this? Many thanks in advance! I checked RStudio the settings for the encodings seem right (utf-8).

rdatasculptor commented 5 years ago

I solved it with text <- enc2utf8(text).