bnosac / udpipe

R package for Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing Based on the UDPipe Natural Language Processing Toolkit
https://bnosac.github.io/udpipe/en
Mozilla Public License 2.0
209 stars 33 forks source link

korean encoding issue #10

Closed mrchypark closed 6 years ago

mrchypark commented 6 years ago

https://github.com/bnosac/udpipe/pull/9

new start.

jwijffels commented 6 years ago

I don't know of a clear solution if you really want to incorporate the code your provided inside the function. But the docs in udpipe_annotate are pretty clear on the input needed. It says that x should be a character vector in UTF-8 encoding If you do not have x in UTF-8 encoding then you need to make sure x is in UTF-8 encoding. This can be done with iconv As in iconv(x, from = "CP949", to = "UTF-8") Where the list of encodings is specified in iconvlist for example

Encoding("Je n'aime pas ça")
[1] "latin1"
Encoding(iconv("Je n'aime pas ça", from = "latin1", to = "UTF-8"))
[1] "UTF-8"

But the default encoding if you type in text in R depends on your locale, mine is as follows.

Sys.getlocale()
[1] "LC_COLLATE=Dutch_Netherlands.1252;LC_CTYPE=Dutch_Netherlands.1252;LC_MONETARY=Dutch_Netherlands.1252;LC_NUMERIC=C;LC_TIME=Dutch_Netherlands.1252"
localeToCharset()
[1] "ISO8859-1"

So for my case I would need to do

library(udpipe)
ud_model <- udpipe_download_model("french")
ud_model <- udpipe_load_model(ud_model$file_model)
x <- udpipe_annotate(ud_model, x = iconv("Je n'aime pas ça", from = "latin1", to = "UTF-8"))
as.data.frame(x)

Testing inside udpipe_annotate if x is in UTF-8 has the following complexities shown below:

As a result, I'm reluctant to do any fixes inside the udpipe_annotate function. The user just needs to make sure his input is in UTF-8

> ## ASCII is always Encoding unknown
> Sys.setlocale("LC_ALL", locale = "Dutch_Netherlands.1252")
[1] "LC_COLLATE=Dutch_Netherlands.1252;LC_CTYPE=Dutch_Netherlands.1252;LC_MONETARY=Dutch_Netherlands.1252;LC_NUMERIC=C;LC_TIME=Dutch_Netherlands.1252"
> x <- "I drink milk in the morning"
> Encoding(x)
[1] "unknown"
> Encoding(iconv(x, to = "UTF-8"))
[1] "unknown"
> Sys.setlocale("LC_ALL", locale = "Dutch_Netherlands.1252")
[1] "LC_COLLATE=Dutch_Netherlands.1252;LC_CTYPE=Dutch_Netherlands.1252;LC_MONETARY=Dutch_Netherlands.1252;LC_NUMERIC=C;LC_TIME=Dutch_Netherlands.1252"
> x <- "안녕하세요. 저는 박찬엽입니다. 한글의 인코딩 문제를 재현하려고 합니다."
> Encoding(x)
[1] "UTF-8"
> out <- iconv(x, from = "UTF-8", to = "CP949")
> iconv(out, from = "CP949", to = "UTF-8")
[1] "안녕하세요. 저는 박찬엽입니다. 한글의 인코딩 문제를 재현하려고 합니다."
> result <- iconv(out, to = "UTF-8")
> Sys.setlocale("LC_ALL", locale = "Korean_Korea.949")
[1] "LC_COLLATE=Korean_Korea.949;LC_CTYPE=Korean_Korea.949;LC_MONETARY=Korean_Korea.949;LC_NUMERIC=C;LC_TIME=Korean_Korea.949"
> result
[1] "¾È³çÇϼ¼¿ä. Àú´Â ¹ÚÂù¿±ÀÔ´Ï´Ù. ÇѱÛÀÇ ÀÎÄÚµù ¹®Á¦¸¦ ÀçÇöÇÏ·Á°í ÇÕ´Ï´Ù."
dselivanov commented 6 years ago

Absolutely agree with @jwijffels

mrchypark commented 6 years ago

I agree with The user just needs to make sure his input is in UTF-8.

how about add warning message if Encoding(x)!= "UTF-8"?

dselivanov commented 6 years ago

As mentioned above ascii will have unknown encoding which is != Utf-8.

19 янв. 2018 г. 13:39 пользователь "Chan-Yub Park" notifications@github.com написал:

I agree with The user just needs to make sure his input is in UTF-8.

how about add warning message if Encoding(x)!= "UTF-8"?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/bnosac/udpipe/issues/10#issuecomment-358914667, or mute the thread https://github.com/notifications/unsubscribe-auth/AE4u3WNMmpuyh5ZweYTwAN0o5N1jt294ks5tMGLSgaJpZM4RjxYj .

dselivanov commented 6 years ago

I mean it would be nice to have such check/warning but it seems it will be tricky to implement it.

mrchypark commented 6 years ago

Ok, just my opinion. Thank you guys for support and discuss. Is it ok to close issue?

jwijffels commented 6 years ago

Completely agree that it would be nice to have such a check/warning but due to the 2 elements I just enumerated, I don't know of any valid way on how to implement this.

jwijffels commented 6 years ago

Closing this. Feel free to re-open if needed.