Closed daroczig closed 5 years ago
Would love to simplify my life with (a), but cannot really do that yet :S
Regarding the locale issue: good point, but I'm pretty sure that the embedded python uses the same locale as R, so I think that's fine, see eg:
$ LC_ALL=hu_HU R ✓
> Sys.getlocale()
[1] "LC_CTYPE=hu_HU;LC_NUMERIC=C;LC_TIME=hu_HU;LC_COLLATE=hu_HU;LC_MONETARY=hu_HU;LC_MESSAGES=hu_HU;LC_PAPER=hu_HU;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=hu_HU;LC_IDENTIFICATION=C"
> library(reticulate)
> l <- import('locale')
> l$getlocale()
[[1]]
[1] "hu_HU"
[[2]]
[1] "ISO8859-2"
> l$setlocale(l$LC_ALL, 'hu_HU.UTF-8')
[1] "hu_HU.UTF-8"
> Sys.getlocale()
[1] "hu_HU.UTF-8"
Makes sense. The data I'd been dealing with has been Chinese URL query strings - a mix of a few different encodings in a single data set (depends on people's OS and browser) - and it's been my experience that it's much easier to clean that type of data in python3 than fight with base::iconv
.
I would recommend a) dropping python2 support asap :) and b) passing the source encoding scheme (can read it from the locale) to the python decode method, and renencode to utf8 so that the text comes back as utf-8 explictly from python. If the bytes in python are a different locale than the one rawToChar defaults to, it will be "bad" / potentially corrupted. You can register an error handler on the python side: https://docs.python.org/3/library/stdtypes.html#bytearray.decode and then you won't need to try/catch decoding errors in R.