Closed kwhkim closed 2 years ago
Which operating system are you on? If it's Windows, I suggest you check your current native encoding with stri_enc_get
It might be that R on your Windows doesn't recognise Korean characters as UTF-8 properly..
Can you provide me with results of Sys.getlocale()
and stri_info()
on your machine?
On your advice, I tried this
> x <- iconv("힣", to='UTF-8')
> stri_conv(x, to='EUC-KR')
[1] "\\xaf\\xfe"
Warning message:
In stri_conv(x, to = "EUC-KR") :
the Unicode code point \U0000d7a3 cannot be converted to destination encoding
> stri_conv(x, to='cp949')
[1] "\\xaf\\xfe"
Warning message:
In stri_conv(x, to = "cp949") :
the Unicode code point \U0000d7a3 cannot be converted to destination encoding
>
> iconv(x, to='EUC-KR')
[1] NA
> iconv(x, to='cp949')
[1] NA
>
>
> x <- '힣'
> stri_conv(x, to='EUC-KR')
[1] "\\xaf\\xfe"
Warning message:
In stri_conv(x, to = "EUC-KR") :
the Unicode code point \U0000d7a3 cannot be converted to destination encoding
> stri_conv(x, to='cp949')
[1] "\\xaf\\xfe"
Warning message:
In stri_conv(x, to = "cp949") :
the Unicode code point \U0000d7a3 cannot be converted to destination encoding
>
> iconv(x, to='EUC-KR')
[1] "힣"
> iconv(x, to='cp949')
[1] "힣"
Funy it looks..
Here's my seesion info
> stri_enc_get()
[1] "KSC_5601"
> Sys.getlocale()
[1] "LC_COLLATE=Korean_Korea.949;LC_CTYPE=Korean_Korea.949;LC_MONETARY=Korean_Korea.949;LC_NUMERIC=C;LC_TIME=Korean_Korea.949"
> stri_info()
$Unicode.version
[1] "13.0"
$ICU.version
[1] "69.1"
$Locale
$Locale$Language
[1] "ko"
$Locale$Country
[1] "KR"
$Locale$Variant
[1] ""
$Locale$Name
[1] "ko_KR"
$Charset.internal
[1] "UTF-8" "UTF-16"
$Charset.native
$Charset.native$Name.friendly
[1] "KSC_5601"
$Charset.native$Name.ICU
[1] "windows-949-2000"
$Charset.native$Name.UTR22
[1] "windows-949-2000"
$Charset.native$Name.IBM
[1] NA
$Charset.native$Name.WINDOWS
[1] "windows-949"
$Charset.native$Name.JAVA
[1] "windows-949"
$Charset.native$Name.IANA
[1] NA
$Charset.native$Name.MIME
[1] "KSC_5601"
$Charset.native$ASCII.subset
[1] TRUE
$Charset.native$Unicode.1to1
[1] NA
$Charset.native$CharSize.8bit
[1] FALSE
$Charset.native$CharSize.min
[1] 1
$Charset.native$CharSize.max
[1] 2
$ICU.system
[1] FALSE
$ICU.UTF8
[1] FALSE
Warning message:
In stri_info() :
Your native charset does not map to Unicode well. This may cause serious problems. Consider switching to UTF-8.
I use Windows so my encoding is cp949
I think... is there any way I can change to UTF-8?
I dont know if it's possible to change to UTF-8 when using windows. I can change saving encoding to UTF-8 though
The outputs of stri_conv with the destination encoding not being UTF-8 make sense, because it's not your platform's native encoding. This function is mostly useful when you'd wish to convert some text and then save it in a text file.
The support for UTF-8 as native encoding on Windows is still considered experimental, see https://github.com/r-windows/docs/blob/master/ucrt.md#readme
My bad, one of the above shoud have from="UTF-8"
argument because UTF-8 is not native encoding.
Here is modifications that i made.
> x <- iconv("힣", to='UTF-8')
> stri_conv(x, from='UTF-8', to='EUC-KR', to_raw = TRUE)
[[1]]
[1] af fe
Warning message:
In stri_conv(x, from = "UTF-8", to = "EUC-KR", to_raw = TRUE) :
the Unicode code point \U0000d7a3 cannot be converted to destination encoding
> stri_conv(x, from='UTF-8', to='cp949', to_raw = TRUE)
[[1]]
[1] af fe
Warning message:
In stri_conv(x, from = "UTF-8", to = "cp949", to_raw = TRUE) :
the Unicode code point \U0000d7a3 cannot be converted to destination encoding
>
> iconv(x, from='UTF-8', to='EUC-KR', toRaw=TRUE)
[[1]]
[1] c6 52
> iconv(x, from='UTF-8', to='cp949', toRaw = TRUE)
[[1]]
[1] c6 52
>
> stri_conv(x, from='UTF-8', to='EUC-KR')
[1] "\\xaf\\xfe"
Warning message:
In stri_conv(x, from = "UTF-8", to = "EUC-KR") :
the Unicode code point \U0000d7a3 cannot be converted to destination encoding
> stri_conv(x, from='UTF-8', to='cp949')
[1] "\\xaf\\xfe"
Warning message:
In stri_conv(x, from = "UTF-8", to = "cp949") :
the Unicode code point \U0000d7a3 cannot be converted to destination encoding
>
> iconv(x, from='UTF-8', to='EUC-KR')
[1] "힣"
> iconv(x, from='UTF-8', to='cp949')
[1] "힣"
>
> x <- iconv("힣", to='UTF-8')
> iconv(x, from='UTF-8', to = 'UTF-32BE')
Error in iconv(x, from = "UTF-8", to = "UTF-32BE") :
embedded nul in string: '\0\0龍'
> stri_conv(x, from='UTF-8', to='UTF-32BE')
Error in stri_conv(x, from = "UTF-8", to = "UTF-32BE") :
embedded nul in string: ''
> stri_conv(x, from='UTF-8', to='EUC-KR', to_raw = TRUE)
[[1]]
[1] af fe
Warning message:
In stri_conv(x, from = "UTF-8", to = "EUC-KR", to_raw = TRUE) :
the Unicode code point \U0000d7a3 cannot be converted to destination encoding
The result above explicitly shows what is the problem...
As far as I know, '힣' is not encodable to EUC-KR and but is encodable to CP949.
Anyway,
I think my native encoding is cp949, which I confirmed with command line chcp
.
I do not know where KSC_5601
which is the result of stri_enc_get()
comes from. Do you???
Here is what I tried.
> stri_enc_detect('어젯밤 나는 꿈을 꿨다. 하지만 가는 길에 돌아가서 오는 길에 돌아오다. 한글은 쓰기 쉽지만 한국어는 어렵다고들 한다지만 나는 괜찮다 왜냐하면 모국어라서')
[[1]]
Encoding Language Confidence
1 EUC-KR ko 1.00
2 ISO-8859-6 ar 0.26
3 IBM420_ltr ar 0.26
4 KOI8-R ru 0.18
5 ISO-8859-5 ru 0.12
6 UTF-16BE 0.10
7 UTF-16LE 0.10
8 GB18030 zh 0.10
9 EUC-JP ja 0.10
10 Big5 zh 0.10
11 windows-1256 ar 0.09
12 IBM420_rtl ar 0.09
13 ISO-8859-8 he 0.07
14 ISO-8859-1 fr 0.05
15 windows-1251 ru 0.02
As you can see from iconv('힣', to='UTF-32BE')
,
iconv
can encode UTF-8
to any encoding scheme, nevertheless R character can not mark it as the encoding it is in.
Because it can only be in UTF-8 or native encoding... So it prints out jumble of mojbakes or uninterpretable letters.
Anyway, could you let me know what is the source of trans-encoding mapping? I might be able to look into what is wrong...
Seems that character U+d7a3 is not included in EUC-KR and CP 949 in ICU - but then it's an issue related to ICU, not just stringi..
EUC-KR uses https://github.com/unicode-org/icu/blob/main/icu4c/source/data/mappings/ibm-970_P110_P110-2006_U2.ucm
cp949 uses https://github.com/unicode-org/icu/blob/main/icu4c/source/data/mappings/ibm-949_P110-1999.ucm
The encodings that feature the above characters are:
file:///pub/src/icu/data/mappings/ibm-1363_P11B-1998.ucm
16826: <UD7A3> \xC6\x52 |0
file:///pub/src/icu/data/mappings/ibm-1363_P110-1997.ucm
16950: <UD7A3> \xC6\x52 |0
file:///pub/src/icu/data/mappings/ibm-1364_P110-2007.ucm
17460: <UD7A3> \xD3\xBD |0
file:///pub/src/icu/data/mappings/icu-internal-compound-t.ucm
30911: <UD7A3> \xED\x9E\xA3 |0
file:///pub/src/icu/data/mappings/windows-949-2000.ucm
17182: <UD7A3> \xC6\x52 |0
So I guess you should be using stringi::stri_conv("힣", to='windows-949')
in your case
Yes... I didnot know 'CP949' can mean another encoding, code page 949(IBM)
It looks like there must be some standards for referring to a encoding method...
As far as I know, CP949
means Windows code page 949.
So cp949 is windows for iconv, and it is IBM's for ICU... Hmm
I wonder what's wrong with EUC-KR
for iconv
. There must be some other encoding named EUC-KR...
> iconv('힣', to='EUC-KR')
[1] "힣"
> iconv('힣', to='euc-kr')
[1] "힣"
> iconv('힣', to='euckr')
[1] NA
Seriously?
Since it is more of a problem of encoding NAME, i'll close this issue. Hope to find how the encoding name is set and if there is any standards
Hello, thank you for the nice package!
EUC-KR and CP949 are famous Korean Encoding scheme.
It is well-known that CP949 can encode almost all Korean characters but EUC-KR has some characters not encod-able.
For example, '힣' is known not be in character set of EUC-KR.
I tried the below and it does not seem to be correct.
Funny that
iconv
andstri_conv
does not match.So I am wondering the authors of stringi and R core have something confused...
I am wonder what was the reference you make use of...