Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.58k stars 976 forks source link

tibble name_repair and .. in variable names #3321

Open philippechataignon opened 5 years ago

philippechataignon commented 5 years ago

I sometimes use readxl package to read Excel file. Since last version (1.2.0) I noticed a problem related to the version 2.0 of the package tibble. Now, Excel columns with no header name are renamed ..1, ..2 because tibble changes his algorithm for repairing variable names. See this article : .name_repair.

I always use data.table for my data.frames and my program broke after readxl upgrade with a data.table error. It seems there is a confusion with variables names ..1, ..2 and the .. data.table syntax used as with=F replacement.

Perhaps data.table can interpret ..1 as a variable name because it can't be a indirection. I never realize before but ..var is a valid variable name in R.

library(data.table)
library(readxl)
tmp <- tempfile()
download.file(url = "https://chataignon.com/files/read_xl.xlsx", destfile = tmp)
# essai de l'URL 'https://chataignon.com/files/read_xl.xlsx'
# Content type 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet' length 5316 bytes
# ==================================================
# downloaded 5316 bytes

dt <- setDT(read_xlsx(tmp))
# New names:                                                                                                                                                                                               
# * `` -> `..2`
# * `` -> `..3`
unlink(tmp)
print(dt)
#      id     ..2        ..3      text Nom très long à la Excel
# 1: 1001 1002001  0.9199906       sdf                       45
# 2: 1002 1004004  0.1672665  dfsdfsdf                       65
# 3: 1003 1006009 -0.7392416 sdffdsfsd                       78
# 4: 1004 1008016 -0.9660944 sdfsdfsdf                       98
# 5: 1005 1010025 -0.3047245  dfsdfsdf                       45
# 6: 1006 1012036  0.6368077     fsdfs                        2
# 7: 1007 1014049  0.9928619 sdfsdfsdf                       12
# 8: 1008 1016064  0.4360834  dfsdfsdf                       13
# 9: 1009 1018081 -0.5216282       dff                        4
dt[, ..2]
# Error in `[.data.table`(dt, , ..2) : 
#   Variable '2' is not found in calling scope. Looking in calling scope because you used the .. prefix.
dt[, `..2`]
# Error in `[.data.table`(dt, , ..2) : 
#   Variable '2' is not found in calling scope. Looking in calling scope because you used the .. prefix.
dt[, "..2"]
#        ..2
# 1: 1002001
# 2: 1004004
# 3: 1006009
# 4: 1008016
# 5: 1010025
# 6: 1012036
# 7: 1014049
# 8: 1016064
# 9: 1018081

# Output of sessionInfo()

> sessionInfo()
R version 3.5.2 (2018-12-20)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 9 (stretch)

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.7.0
LAPACK: /usr/lib/lapack/liblapack.so.3.7.0

locale:
 [1] LC_CTYPE=fr_FR.UTF-8       LC_NUMERIC=C               LC_TIME=fr_FR.UTF-8        LC_COLLATE=fr_FR.UTF-8     LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=fr_FR.UTF-8    LC_PAPER=fr_FR.UTF-8      
 [8] LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.12.0 readxl_1.2.0     

loaded via a namespace (and not attached):
 [1] compiler_3.5.2   tools_3.5.2      pillar_1.3.1     rstudioapi_0.9.0 tibble_2.0.1     crayon_1.3.4     Rcpp_1.0.0       cellranger_1.1.0 pkgconfig_2.0.2  rlang_0.3.
MichaelChirico commented 5 years ago

What a terrible default :man_facepalming:

hardly seems like a "repair"

Atrebas commented 5 years ago

fwiw, openxlsx works well (and is more tinyverse-friendly).

library(data.table)
library(openxlsx)
DT <- setDT(openxlsx::read.xlsx("https://chataignon.com/files/read_xl.xlsx"))
DT
#      id      X2         X3      text Nom.très.long.à.la.Excel
# 1: 1001 1002001  0.9199906       sdf                       45
# 2: 1002 1004004  0.1672665  dfsdfsdf                       65
# 3: 1003 1006009 -0.7392416 sdffdsfsd                       78
jangorecki commented 5 years ago

There is another example of ..colname use in https://github.com/mlr-org/mlr3/blob/adc6768e9817364a969797eba9157c4f2844bc45/R/DataBackendDataTable.R#L141 which uses such name for own internal processing, probably not exposed to user