gaborcsardi / rencfaq

The R Encoding FAQ
Creative Commons Zero v1.0 Universal
67 stars 3 forks source link

A small empirical study of `basename()` and `normalizePath()` #6

Open jennybc opened 2 years ago

jennybc commented 2 years ago

I've been sorting out some filepath encoding issues in vroom and eventually grew desperate enough to make this table.

Anyone interested in this repo might also find this interesting. The OS R version locale combos arise from what I can easily lay my hands on / what I've had to setup for the vroom work:

            | R       |                           | encoding |                | encoding
OS          | version | locale                    | of input | function       | of output
------------+---------+---------------------------+----------+----------------+----------
macOS         4.1.2     en_CA.UTF-8                 UTF-8      basename()       "unknown" (but UTF-8 bytes)
macOS         4.1.2     en_CA.UTF-8                 UTF-8      normalizePath()  UTF-8

windows       4.2.0     English_United States.utf8  UTF-8      basename()       UTF-8
windows       4.2.0     English_United States.utf8  UTF-8      normalizePath()  UTF-8

ubuntu 18.04  4.2.0     C.UTF-8                     UTF-8      basename()       "unknown" (but UTF-8 bytes)
ubuntu 18.04  4.2.0     C.UTF-8                     UTF-8      normalizePath()  UTF-8

windows       4.1.2     English_United States.1252  UTF-8      basename()       UTF-8
windows       4.1.2     English_United States.1252  UTF-8      normalizePath()  UTF-8

ubuntu 18.04  4.2.0     en_US (this is ISO-8859-1)  UTF-8      basename()       "unknown" (but latin1 bytes)
ubuntu 18.04  4.2.0     en_US (this is ISO-8859-1)  UTF-8      normalizePath()  latin1

Things that jump out:

Here's the code snippet I ran in various places:

R.version.string
.Platform$OS.type
Sys.getlocale()
l10n_info()

filepath <- "b\u00e9.csv"
Encoding(filepath)
charToRaw(filepath)

Encoding(basename(filepath))
charToRaw(basename(filepath))

Encoding(normalizePath(filepath, mustWork = FALSE))
charToRaw(normalizePath(filepath, mustWork = FALSE))