RConsortium / wishlist

A wishlist of idea from the ISC and community
29 stars 0 forks source link

Improve UTF-8 support on Windows #2

Open kevinushey opened 8 years ago

kevinushey commented 8 years ago

Currently, a number of R APIs do the wrong thing on Windows with UTF-8 strings. In many cases, this is because R attempts to roundtrip UTF-8 strings through the system encoding, which can often fail. These problems pop up most frequently when interacting with the file system, but also in other respects -- some simple examples:


path.expand()

> path <- "鬼.R"
> path.expand(path)
[1] "<U+9B3C>.R"

list.files()

> setwd(tempdir())
> path <- "鬼.R"
> file.create(path)
[1] TRUE
> list.files()
[1] "?.R"

data.frame()

> df <- data.frame("鬼" = "鬼")
> df
  X.U.9B3C.
1  <U+9B3C>
> names(df)
[1] "X.U.9B3C."
> df[[1]]
[1] 鬼
Levels: <U+9B3C>

The filesystem issues could potentially be handled by creating a separate R package, providing a new filesystem API for use within R, but it would be wonderful if these issues could be resolved in R itself. However, one could imagine producing a file system API that is more consistent / featureful / opinionated in a number of ways.

Would it be possible for an R Consortium funded effort to spearhead this?

krlmlr commented 7 years ago

Relevant read: http://utf8everywhere.org/.