I've been sorting out some filepath encoding issues in vroom and eventually grew desperate enough to make this table.
Anyone interested in this repo might also find this interesting. The OS R version locale combos arise from what I can easily lay my hands on / what I've had to setup for the vroom work:
| R | | encoding | | encoding
OS | version | locale | of input | function | of output
------------+---------+---------------------------+----------+----------------+----------
macOS 4.1.2 en_CA.UTF-8 UTF-8 basename() "unknown" (but UTF-8 bytes)
macOS 4.1.2 en_CA.UTF-8 UTF-8 normalizePath() UTF-8
windows 4.2.0 English_United States.utf8 UTF-8 basename() UTF-8
windows 4.2.0 English_United States.utf8 UTF-8 normalizePath() UTF-8
ubuntu 18.04 4.2.0 C.UTF-8 UTF-8 basename() "unknown" (but UTF-8 bytes)
ubuntu 18.04 4.2.0 C.UTF-8 UTF-8 normalizePath() UTF-8
windows 4.1.2 English_United States.1252 UTF-8 basename() UTF-8
windows 4.1.2 English_United States.1252 UTF-8 normalizePath() UTF-8
ubuntu 18.04 4.2.0 en_US (this is ISO-8859-1) UTF-8 basename() "unknown" (but latin1 bytes)
ubuntu 18.04 4.2.0 en_US (this is ISO-8859-1) UTF-8 normalizePath() latin1
Things that jump out:
basename() appears to re-encode to native on unix, but then marks the string as having "unknown" encoding.
basename() retains UTF-8 bytes and encoding mark on Windows, even if UTF-8 is not the native encoding.
normalizePath() re-encodes to native in unix and also marks the encoding correctly.
normalizePath() retains UTF-8 bytes and encoding mark on Windows, even if UTF-8 is not the native encoding.
I've been sorting out some filepath encoding issues in vroom and eventually grew desperate enough to make this table.
Anyone interested in this repo might also find this interesting. The OS R version locale combos arise from what I can easily lay my hands on / what I've had to setup for the vroom work:
Things that jump out:
basename()
appears to re-encode to native on unix, but then marks the string as having"unknown"
encoding.basename()
retains UTF-8 bytes and encoding mark on Windows, even if UTF-8 is not the native encoding.normalizePath()
re-encodes to native in unix and also marks the encoding correctly.normalizePath()
retains UTF-8 bytes and encoding mark on Windows, even if UTF-8 is not the native encoding.Here's the code snippet I ran in various places: