gagolews / stringi

Fast and portable character string processing in R (with the Unicode ICU)
https://stringi.gagolewski.com/
Other
300 stars 45 forks source link

euc-kr and cp949 does not seem to be correct #466

Closed kwhkim closed 2 years ago

kwhkim commented 2 years ago

Hello, thank you for the nice package!

EUC-KR and CP949 are famous Korean Encoding scheme.

It is well-known that CP949 can encode almost all Korean characters but EUC-KR has some characters not encod-able.

For example, '힣' is known not be in character set of EUC-KR.

I tried the below and it does not seem to be correct.

> stri_conv('힣', to='EUC-KR')
[1] "\\xaf\\xfe"
Warning message:
In stri_conv("힣", to = "EUC-KR") :
  the Unicode code point \U0000d7a3 cannot be converted to destination encoding
> stri_conv('힣', to='cp949')
[1] "\\xaf\\xfe"
Warning message:
In stri_conv("힣", to = "cp949") :
  the Unicode code point \U0000d7a3 cannot be converted to destination encoding
> iconv('힣', to='EUC-KR')
[1] "힣"
> iconv('힣', to='CP949')
[1] "힣"

Funny that iconv and stri_conv does not match.

So I am wondering the authors of stringi and R core have something confused...

I am wonder what was the reference you make use of...

gagolews commented 2 years ago

Which operating system are you on? If it's Windows, I suggest you check your current native encoding with stri_enc_get

It might be that R on your Windows doesn't recognise Korean characters as UTF-8 properly..

Can you provide me with results of Sys.getlocale() and stri_info() on your machine?

kwhkim commented 2 years ago

On your advice, I tried this

> x <- iconv("힣", to='UTF-8')
> stri_conv(x, to='EUC-KR')
[1] "\\xaf\\xfe"
Warning message:
In stri_conv(x, to = "EUC-KR") :
  the Unicode code point \U0000d7a3 cannot be converted to destination encoding
> stri_conv(x, to='cp949')
[1] "\\xaf\\xfe"
Warning message:
In stri_conv(x, to = "cp949") :
  the Unicode code point \U0000d7a3 cannot be converted to destination encoding
> 
> iconv(x, to='EUC-KR')
[1] NA
> iconv(x, to='cp949')
[1] NA
> 
> 
> x <- '힣'
> stri_conv(x, to='EUC-KR')
[1] "\\xaf\\xfe"
Warning message:
In stri_conv(x, to = "EUC-KR") :
  the Unicode code point \U0000d7a3 cannot be converted to destination encoding
> stri_conv(x, to='cp949')
[1] "\\xaf\\xfe"
Warning message:
In stri_conv(x, to = "cp949") :
  the Unicode code point \U0000d7a3 cannot be converted to destination encoding
> 
> iconv(x, to='EUC-KR')
[1] "힣"
> iconv(x, to='cp949')
[1] "힣"

Funy it looks..

Here's my seesion info

> stri_enc_get()
[1] "KSC_5601"
> Sys.getlocale()
[1] "LC_COLLATE=Korean_Korea.949;LC_CTYPE=Korean_Korea.949;LC_MONETARY=Korean_Korea.949;LC_NUMERIC=C;LC_TIME=Korean_Korea.949"
> stri_info()
$Unicode.version
[1] "13.0"

$ICU.version
[1] "69.1"

$Locale
$Locale$Language
[1] "ko"

$Locale$Country
[1] "KR"

$Locale$Variant
[1] ""

$Locale$Name
[1] "ko_KR"

$Charset.internal
[1] "UTF-8"  "UTF-16"

$Charset.native
$Charset.native$Name.friendly
[1] "KSC_5601"

$Charset.native$Name.ICU
[1] "windows-949-2000"

$Charset.native$Name.UTR22
[1] "windows-949-2000"

$Charset.native$Name.IBM
[1] NA

$Charset.native$Name.WINDOWS
[1] "windows-949"

$Charset.native$Name.JAVA
[1] "windows-949"

$Charset.native$Name.IANA
[1] NA

$Charset.native$Name.MIME
[1] "KSC_5601"

$Charset.native$ASCII.subset
[1] TRUE

$Charset.native$Unicode.1to1
[1] NA

$Charset.native$CharSize.8bit
[1] FALSE

$Charset.native$CharSize.min
[1] 1

$Charset.native$CharSize.max
[1] 2

$ICU.system
[1] FALSE

$ICU.UTF8
[1] FALSE

Warning message:
In stri_info() :
  Your native charset does not map to Unicode well. This may cause serious problems. Consider switching to UTF-8.

I use Windows so my encoding is cp949 I think... is there any way I can change to UTF-8?

I dont know if it's possible to change to UTF-8 when using windows. I can change saving encoding to UTF-8 though

gagolews commented 2 years ago

The outputs of stri_conv with the destination encoding not being UTF-8 make sense, because it's not your platform's native encoding. This function is mostly useful when you'd wish to convert some text and then save it in a text file.

The support for UTF-8 as native encoding on Windows is still considered experimental, see https://github.com/r-windows/docs/blob/master/ucrt.md#readme

kwhkim commented 2 years ago

My bad, one of the above shoud have from="UTF-8" argument because UTF-8 is not native encoding.

Here is modifications that i made.

> x <- iconv("힣", to='UTF-8')
> stri_conv(x, from='UTF-8', to='EUC-KR', to_raw = TRUE)
[[1]]
[1] af fe

Warning message:
In stri_conv(x, from = "UTF-8", to = "EUC-KR", to_raw = TRUE) :
  the Unicode code point \U0000d7a3 cannot be converted to destination encoding
> stri_conv(x, from='UTF-8', to='cp949', to_raw = TRUE)
[[1]]
[1] af fe

Warning message:
In stri_conv(x, from = "UTF-8", to = "cp949", to_raw = TRUE) :
  the Unicode code point \U0000d7a3 cannot be converted to destination encoding
> 
> iconv(x, from='UTF-8', to='EUC-KR', toRaw=TRUE)
[[1]]
[1] c6 52

> iconv(x, from='UTF-8', to='cp949', toRaw = TRUE)
[[1]]
[1] c6 52

> 
> stri_conv(x, from='UTF-8', to='EUC-KR')
[1] "\\xaf\\xfe"
Warning message:
In stri_conv(x, from = "UTF-8", to = "EUC-KR") :
  the Unicode code point \U0000d7a3 cannot be converted to destination encoding
> stri_conv(x, from='UTF-8', to='cp949')
[1] "\\xaf\\xfe"
Warning message:
In stri_conv(x, from = "UTF-8", to = "cp949") :
  the Unicode code point \U0000d7a3 cannot be converted to destination encoding
> 
> iconv(x, from='UTF-8', to='EUC-KR')
[1] "힣"
> iconv(x, from='UTF-8', to='cp949')
[1] "힣"
> 
> x <- iconv("힣", to='UTF-8')
> iconv(x, from='UTF-8', to = 'UTF-32BE')
Error in iconv(x, from = "UTF-8", to = "UTF-32BE") : 
  embedded nul in string: '\0\0龍'
> stri_conv(x, from='UTF-8', to='UTF-32BE')
Error in stri_conv(x, from = "UTF-8", to = "UTF-32BE") : 
  embedded nul in string: ''
> stri_conv(x, from='UTF-8', to='EUC-KR', to_raw = TRUE)
[[1]]
[1] af fe

Warning message:
In stri_conv(x, from = "UTF-8", to = "EUC-KR", to_raw = TRUE) :
  the Unicode code point \U0000d7a3 cannot be converted to destination encoding

The result above explicitly shows what is the problem...

As far as I know, '힣' is not encodable to EUC-KR and but is encodable to CP949.

Anyway,

I think my native encoding is cp949, which I confirmed with command line chcp.

I do not know where KSC_5601 which is the result of stri_enc_get() comes from. Do you???

Here is what I tried.

> stri_enc_detect('어젯밤 나는 꿈을 꿨다. 하지만 가는 길에 돌아가서 오는 길에 돌아오다. 한글은 쓰기 쉽지만 한국어는 어렵다고들 한다지만 나는 괜찮다 왜냐하면 모국어라서')
[[1]]
       Encoding Language Confidence
1        EUC-KR       ko       1.00
2    ISO-8859-6       ar       0.26
3    IBM420_ltr       ar       0.26
4        KOI8-R       ru       0.18
5    ISO-8859-5       ru       0.12
6      UTF-16BE                0.10
7      UTF-16LE                0.10
8       GB18030       zh       0.10
9        EUC-JP       ja       0.10
10         Big5       zh       0.10
11 windows-1256       ar       0.09
12   IBM420_rtl       ar       0.09
13   ISO-8859-8       he       0.07
14   ISO-8859-1       fr       0.05
15 windows-1251       ru       0.02
kwhkim commented 2 years ago

As you can see from iconv('힣', to='UTF-32BE'),

iconv can encode UTF-8 to any encoding scheme, nevertheless R character can not mark it as the encoding it is in.

Because it can only be in UTF-8 or native encoding... So it prints out jumble of mojbakes or uninterpretable letters.

Anyway, could you let me know what is the source of trans-encoding mapping? I might be able to look into what is wrong...

gagolews commented 2 years ago

ICU doc: https://unicode-org.github.io/icu/userguide/conversion/converters.html

gagolews commented 2 years ago

Seems that character U+d7a3 is not included in EUC-KR and CP 949 in ICU - but then it's an issue related to ICU, not just stringi..

EUC-KR uses https://github.com/unicode-org/icu/blob/main/icu4c/source/data/mappings/ibm-970_P110_P110-2006_U2.ucm

cp949 uses https://github.com/unicode-org/icu/blob/main/icu4c/source/data/mappings/ibm-949_P110-1999.ucm

The encodings that feature the above characters are:

file:///pub/src/icu/data/mappings/ibm-1363_P11B-1998.ucm
16826: <UD7A3> \xC6\x52 |0
file:///pub/src/icu/data/mappings/ibm-1363_P110-1997.ucm
16950: <UD7A3> \xC6\x52 |0
file:///pub/src/icu/data/mappings/ibm-1364_P110-2007.ucm
17460: <UD7A3> \xD3\xBD |0
file:///pub/src/icu/data/mappings/icu-internal-compound-t.ucm
30911: <UD7A3> \xED\x9E\xA3 |0
file:///pub/src/icu/data/mappings/windows-949-2000.ucm
17182: <UD7A3> \xC6\x52 |0

So I guess you should be using stringi::stri_conv("힣", to='windows-949') in your case

kwhkim commented 2 years ago

Yes... I didnot know 'CP949' can mean another encoding, code page 949(IBM)

It looks like there must be some standards for referring to a encoding method...

As far as I know, CP949 means Windows code page 949.

So cp949 is windows for iconv, and it is IBM's for ICU... Hmm

I wonder what's wrong with EUC-KR for iconv. There must be some other encoding named EUC-KR...

> iconv('힣', to='EUC-KR')
[1] "힣"
> iconv('힣', to='euc-kr')
[1] "힣"
> iconv('힣', to='euckr')
[1] NA

Seriously?

kwhkim commented 2 years ago

Since it is more of a problem of encoding NAME, i'll close this issue. Hope to find how the encoding name is set and if there is any standards