gagolews / stringi

Fast and portable character string processing in R (with the Unicode ICU)
https://stringi.gagolewski.com/
Other
304 stars 44 forks source link

Buggy "bytes" encoding when using stri_encode() on Windows (from ISO-8859-1 to ISO-8859-1) #384

Closed PolMine closed 4 years ago

PolMine commented 4 years ago

To ensure that my polmineR package is portable, it needs to process textual data with different encodings in Windows and *nix environments, i.e. with ISO-88591-1 and UTF-8 locales. The behavior of stringi I report here has caused my a few headaches and it looks like a bug to me.

The original issue I encountered was that I had a buggy conversion from "latin1" to "ISO-8859-1" on a Windows server. It does not make sense, but it is a scenario that should work. More generally, we have the same effect when converting from ISO-8859-1 to ISO-8859-1.

library(stringi)
y <- stri_encode("verrückt!", from = "ISO8859-1", to = "ISO8859-1")
Encoding(y) # is "bytes" and not "ISO8859-1"
y # is broken

Working on the reprex might, I realized that conversion to ISO-8859-1 causes problems more generally on Windows. But usually you get a warning - not in this case. Note that I have looked into the issue on macOS too, but it really seems to be a Windows thing.

R 4.0.2 (Windows) stringi package version: 1.4.6

gagolews commented 4 years ago

This behaviour is a feature. It is explained in the manual.

See stri_enc_tonative() is your native encoding is not UTF-8 and you want the outputs strings to be marked as natively encoded (e.g., latin1).

Basically you should work with Unicode wherever possible.