A small empirical study of `basename()` and `normalizePath()`

I've been sorting out some filepath encoding issues in vroom and eventually grew desperate enough to make this table.

Anyone interested in this repo might also find this interesting. The OS R version locale combos arise from what I can easily lay my hands on / what I've had to setup for the vroom work:

            | R       |                           | encoding |                | encoding
OS          | version | locale                    | of input | function       | of output
------------+---------+---------------------------+----------+----------------+----------
macOS         4.1.2     en_CA.UTF-8                 UTF-8      basename()       "unknown" (but UTF-8 bytes)
macOS         4.1.2     en_CA.UTF-8                 UTF-8      normalizePath()  UTF-8

windows       4.2.0     English_United States.utf8  UTF-8      basename()       UTF-8
windows       4.2.0     English_United States.utf8  UTF-8      normalizePath()  UTF-8

ubuntu 18.04  4.2.0     C.UTF-8                     UTF-8      basename()       "unknown" (but UTF-8 bytes)
ubuntu 18.04  4.2.0     C.UTF-8                     UTF-8      normalizePath()  UTF-8

windows       4.1.2     English_United States.1252  UTF-8      basename()       UTF-8
windows       4.1.2     English_United States.1252  UTF-8      normalizePath()  UTF-8

ubuntu 18.04  4.2.0     en_US (this is ISO-8859-1)  UTF-8      basename()       "unknown" (but latin1 bytes)
ubuntu 18.04  4.2.0     en_US (this is ISO-8859-1)  UTF-8      normalizePath()  latin1

Things that jump out:

basename() appears to re-encode to native on unix, but then marks the string as having "unknown" encoding.
basename() retains UTF-8 bytes and encoding mark on Windows, even if UTF-8 is not the native encoding.
normalizePath() re-encodes to native in unix and also marks the encoding correctly.
normalizePath() retains UTF-8 bytes and encoding mark on Windows, even if UTF-8 is not the native encoding.

Here's the code snippet I ran in various places:

R.version.string
.Platform$OS.type
Sys.getlocale()
l10n_info()

filepath <- "b\u00e9.csv"
Encoding(filepath)
charToRaw(filepath)

Encoding(basename(filepath))
charToRaw(basename(filepath))

Encoding(normalizePath(filepath, mustWork = FALSE))
charToRaw(normalizePath(filepath, mustWork = FALSE))

gaborcsardi / rencfaq

A small empirical study of `basename()` and `normalizePath()` #6