bnosac / udpipe

R package for Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing Based on the UDPipe Natural Language Processing Toolkit
https://bnosac.github.io/udpipe/en
Mozilla Public License 2.0
209 stars 33 forks source link

korean encoding issue #9

Closed mrchypark closed 6 years ago

mrchypark commented 6 years ago

When I tried to get annotate in korean, text Encoding of result is broken. I fixed to add code below. I checked in windows and ubuntu 16.04

windows

R version 3.4.2 (2017-09-28) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale: [1] LC_COLLATE=Korean_Korea.949 LC_CTYPE=Korean_Korea.949 LC_MONETARY=Korean_Korea.949 [4] LC_NUMERIC=C LC_TIME=Korean_Korea.949

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] udpipe_0.3 RevoUtils_10.0.6 RevoUtilsMath_10.0.1

loaded via a namespace (and not attached): [1] compiler_3.4.2 Matrix_1.2-11 tools_3.4.2 yaml_2.1.14
[5] Rcpp_0.12.13 grid_3.4.2 data.table_1.10.4-2 lattice_0.20-35

ubuntu

R version 3.4.3 (2017-11-30) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Debian GNU/Linux 9 (stretch)

Matrix products: default BLAS: /usr/lib/openblas-base/libblas.so.3 LAPACK: /usr/lib/libopenblasp-r0.2.19.so

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
[4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] udpipe_0.3

loaded via a namespace (and not attached): [1] Rcpp_0.12.14 lattice_0.20-35 digest_0.6.13 withr_2.1.1
[5] grid_3.4.3 R6_2.2.2 git2r_0.20.0 httr_1.3.1
[9] curl_3.0 data.table_1.10.4-3 Matrix_1.2-12 devtools_1.13.4
[13] tools_3.4.3 yaml_2.1.16 compiler_3.4.3 memoise_1.1.0
[17] knitr_1.17

jwijffels commented 6 years ago

It does not make sense to add this in the function. Make sure x is in UTF8 encoding as the doc indicates. Closing.

mrchypark commented 6 years ago

https://mrchypark.github.io/udpipe_korean_error/

jwijffels commented 6 years ago

Yes, that's correct, you need to make sure x is in UTF-8 encoding, that's what the doc of udpipe_annotate indicates. So the second example is how you should do it. Let me show the output of your first example on my computer. If I type in this in my console, it is already immediately UTF-8, which is what udpipe_annotate requests me to give. If you have data in another encoding, you just need to make sure that you put it in UTF-8 before giving it to udpipe_annotate as you showed. Incorporating the pull request would for this reason, shown below give errors on other computers where the default locale is something else then yours.

> x <- "안녕하세요. 저는 박찬엽입니다. 한글의 인코딩 문제를 재현하려고 합니다." 
> Encoding(x) 
[1] "UTF-8" 
> iconv(x, to = "UTF-8") 
[1] "안녕하세요. 저는 박찬엽입니다. 한글ì\u009d˜ ì\u009d¸ì½”딩 문제를 재현하려고 합니다."

 

mrchypark commented 6 years ago

@jwijffels Then, how about check Encoding(x)!="UTF-8" then print warnning message include "you make sure Encoding(x) is UTF-8. If not, let try x <- iconv(x, to = "UTF-8") first."

mrchypark commented 6 years ago

@jwijffels anyway, can you show me your sessionInfo()? I tried to assign text on windows 10, ubuntu 16.04, Mac 10.13.2. and all os return Encoding(x) is "unknown".

jwijffels commented 6 years ago

Checking for 'unknown' encoding is not a good solution as ASCII always gives encoding 'unknown' so that would generate warnings for every call in all European languages, even on CRAN. If you want to reproduce my environment which is Dutch_Netherlands.1252.

Sys.getlocale()
[1] "LC_COLLATE=Dutch_Netherlands.1252;LC_CTYPE=Dutch_Netherlands.1252;LC_MONETARY=Dutch_Netherlands.1252;LC_NUMERIC=C;LC_TIME=Dutch_Netherlands.1252"

Change the locale as follows:

Sys.setlocale("LC_ALL", locale = "Korean_Korea.949")
[1] "LC_COLLATE=Korean_Korea.949;LC_CTYPE=Korean_Korea.949;LC_MONETARY=Korean_Korea.949;LC_NUMERIC=C;LC_TIME=Korean_Korea.949"
localeToCharset()
[1] "CP949"
x <- "안녕하세요. 저는 박찬엽입니다. 한글의 인코딩 문제를 재현하려고 합니다."
Encoding(x)
[1] "unknown"
Sys.setlocale("LC_ALL", locale = "Dutch_Netherlands.1252")
[1] "LC_COLLATE=Dutch_Netherlands.1252;LC_CTYPE=Dutch_Netherlands.1252;LC_MONETARY=Dutch_Netherlands.1252;LC_NUMERIC=C;LC_TIME=Dutch_Netherlands.1252"
localeToCharset()
[1] "ISO8859-1"
x <- "안녕하세요. 저는 박찬엽입니다. 한글의 인코딩 문제를 재현하려고 합니다."
Encoding(x)
[1] "UTF-8"

I think we should move these type of discussions to Issues, as the pull requests will give errors on all European Windows machines.