Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.56k stars 974 forks source link

`rbind` cannot handle different name encodings #5452

Open MEO265 opened 2 years ago

MEO265 commented 2 years ago

In the following case, the bind does not work:

x <- data.table(A = 1, B = 2, C = 3)
y <- copy(x)
setnames(x , c("Ä", "Ö", "Ü"))
setnames(y , iconv(c("Ä", "Ö", "Ü"), from = "UTF-8", to = "latin1"))
Encoding(names(x))
Encoding(names(y))
rbind(x,y)

Output:

[1] "UTF-8" "UTF-8" "UTF-8"
[1] "latin1" "latin1" "latin1"

Error in rbindlist(l, use.names, fill, idcol) (rbindbug.R#7): Column 1 ['Ä'] of item 2 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names.
Show stack trace
> sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22000)

Matrix products: default

locale:
[1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252
[3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C
[5] LC_TIME=German_Germany.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] data.table_1.14.2

loaded via a namespace (and not attached):
[1] compiler_4.1.2
ben-schwen commented 2 years ago

Well you could always use option use.names = FALSE, but I guess what you had in mind was that binding with use.names should work with different encodings.

MEO265 commented 2 years ago

Exactly, that's what I meant. Otherwise you would also have to permanently consider the order of the columns.

dernst commented 2 months ago

Today we encountered a different way to trigger this bug. Although I am unsure if that isn't a bug in R itself. Anyhow. Let's assume we have two data tables, both have a column names with non-ASCII characters, one is initialised in a call to data.table, the other is set with setnames:

dt1 = data.table(Ähm = 1)
dt2 = data.table(Ähm = 1)
setnames(dt2, "Ähm", "Ähm")

Column names have different encodings now (though technically they're both UTF-8):

> colnames(dt1) |> Encoding()
[1] "unknown"
> colnames(dt2) |> Encoding()
[1] "UTF-8"

We get the "expected" result from rbind;

> rbind(dt1, dt2)
Error in rbindlist(l, use.names, fill, idcol) :
  Column 1 ['Ähm'] of item 2 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names.

For data.frame's we'd also get an "unknown" encoding, when we initialize a DF the same as dt1. Lists have the same problem:

> list(Ähm = 1) |> names() |> Encoding()
[1] "unknown"

Perhaps this is related to make.names:

> Encoding("Ähm")
[1] "UTF-8"
> make.names("Ähm") |> Encoding()
[1] "unknown"

As I understand ?Encoding any character string that contains non-ASCII characters should have an encoding of UTF-8 (when running in a UTF-8 locale), while character strings that only contain ASCII characters should have encoding "unknown" (I don't understand what the benefit would be to set the encoding as "unknown" over UTF-8).

The question would then be if data.table should provide a workaround for this "quirk" in R. IMO it's more an R-problem.