daroczig / botor

Reticulate wrapper on 'boto3' with convenient helper functions -- aka "boto fo(u)r R"
https://daroczig.github.io/botor
31 stars 5 forks source link

convert python3 bytes literals to R string #3

Closed daroczig closed 5 years ago

nfultz commented 5 years ago

I would recommend a) dropping python2 support asap :) and b) passing the source encoding scheme (can read it from the locale) to the python decode method, and renencode to utf8 so that the text comes back as utf-8 explictly from python. If the bytes in python are a different locale than the one rawToChar defaults to, it will be "bad" / potentially corrupted. You can register an error handler on the python side: https://docs.python.org/3/library/stdtypes.html#bytearray.decode and then you won't need to try/catch decoding errors in R.

daroczig commented 5 years ago

Would love to simplify my life with (a), but cannot really do that yet :S

Regarding the locale issue: good point, but I'm pretty sure that the embedded python uses the same locale as R, so I think that's fine, see eg:

$ LC_ALL=hu_HU R                                                                                                                                     ✓ 

> Sys.getlocale()
[1] "LC_CTYPE=hu_HU;LC_NUMERIC=C;LC_TIME=hu_HU;LC_COLLATE=hu_HU;LC_MONETARY=hu_HU;LC_MESSAGES=hu_HU;LC_PAPER=hu_HU;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=hu_HU;LC_IDENTIFICATION=C"

> library(reticulate)
> l <- import('locale')
> l$getlocale()
[[1]]
[1] "hu_HU"

[[2]]
[1] "ISO8859-2"

> l$setlocale(l$LC_ALL, 'hu_HU.UTF-8')
[1] "hu_HU.UTF-8"

> Sys.getlocale()
[1] "hu_HU.UTF-8"
nfultz commented 5 years ago

Makes sense. The data I'd been dealing with has been Chinese URL query strings - a mix of a few different encodings in a single data set (depends on people's OS and browser) - and it's been my experience that it's much easier to clean that type of data in python3 than fight with base::iconv.