Buggy "bytes" encoding when using stri_encode() on Windows (from ISO-8859-1 to ISO-8859-1)

To ensure that my polmineR package is portable, it needs to process textual data with different encodings in Windows and *nix environments, i.e. with ISO-88591-1 and UTF-8 locales. The behavior of stringi I report here has caused my a few headaches and it looks like a bug to me.

The original issue I encountered was that I had a buggy conversion from "latin1" to "ISO-8859-1" on a Windows server. It does not make sense, but it is a scenario that should work. More generally, we have the same effect when converting from ISO-8859-1 to ISO-8859-1.

library(stringi)
y <- stri_encode("verrückt!", from = "ISO8859-1", to = "ISO8859-1")
Encoding(y) # is "bytes" and not "ISO8859-1"
y # is broken

Working on the reprex might, I realized that conversion to ISO-8859-1 causes problems more generally on Windows. But usually you get a warning - not in this case. Note that I have looked into the issue on macOS too, but it really seems to be a Windows thing.

R 4.0.2 (Windows) stringi package version: 1.4.6

gagolews / stringi

Buggy "bytes" encoding when using stri_encode() on Windows (from ISO-8859-1 to ISO-8859-1) #384