Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.59k stars 979 forks source link

Modify by reference fails with magrittr if `colnames()<-` was used before #4827

Open eliocamp opened 3 years ago

eliocamp commented 3 years ago

This is a weird one that I hadn't encounter before. If

Normal data.table; modify by reference works with magrittr.

library(data.table)
library(magrittr)

data <- data.table(a = 1)

data %>% 
  .[, b := 2]

data$b
#> [1] 2

Now add a call to colnames()<-. Using magrittr now doesn't modify by reference, but if the resulting object is assigned, it does work

data <- data.table(a = 1)

colnames(data) <- "a"
data %>% 
  .[, b := 2]

data$b
#> NULL

data_2 <- data %>% 
  .[, b := 2]

data_2$b
#> [1] 2

However, using regular data.table syntax does modify by reference.

data[, b := 2]

data$b
#> [1] 2

Related to that, when using attr()<-, data.table throws the expected warning. It says that the problem was fixed, but, in fact, it's not! :(

data <- data.table(a = 1)

attr(data, "a") <- "a"
data %>% 
  .[, b := 2]
#> Warning in `[.data.table`(., , `:=`(b, 2)): Invalid .internal.selfref detected
#> and fixed by taking a (shallow) copy of the data.table so that := can add this
#> new column by reference. At an earlier point, this data.table has been copied
#> by R (or was created manually using structure() or similar). Avoid names<- and
#> attr<- which in R currently (and oddly) may copy the whole data.table. Use set*
#> syntax instead to avoid copying: ?set, ?setnames and ?setattr. If this message
#> doesn't help, please report your use case to the data.table issue tracker so the
#> root cause can be fixed or this message improved.
data$b
#> NULL
Session info ``` r devtools::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.0.3 (2020-10-10) #> os elementary OS 5.1.7 Hera #> system x86_64, linux-gnu #> ui X11 #> language en_GB:en #> collate en_GB.UTF-8 #> ctype en_GB.UTF-8 #> tz America/Argentina/Buenos_Aires #> date 2020-12-02 #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date lib source #> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.3) #> callr 3.5.1 2020-10-13 [1] CRAN (R 4.0.3) #> cli 2.2.0 2020-11-20 [1] CRAN (R 4.0.3) #> crayon 1.3.4.9000 2020-11-11 [1] Github (r-lib/crayon@4bceba8) #> data.table * 1.13.2 2020-10-19 [1] CRAN (R 4.0.3) #> desc 1.2.0 2018-05-01 [1] CRAN (R 4.0.2) #> devtools 2.3.2 2020-09-18 [1] CRAN (R 4.0.2) #> digest 0.6.27 2020-10-24 [1] CRAN (R 4.0.3) #> ellipsis 0.3.1 2020-05-15 [1] RSPM (R 4.0.2) #> evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.2) #> fansi 0.4.1 2020-01-08 [1] CRAN (R 4.0.2) #> fs 1.5.0 2020-07-31 [1] CRAN (R 4.0.2) #> glue 1.4.2 2020-08-27 [1] CRAN (R 4.0.2) #> highr 0.8 2019-03-20 [1] CRAN (R 4.0.2) #> htmltools 0.5.0.9003 2020-11-25 [1] Github (rstudio/htmltools@636b95e) #> knitr 1.30.2 2020-11-25 [1] Github (yihui/knitr@a00710b) #> magrittr * 2.0.1 2020-11-17 [1] CRAN (R 4.0.3) #> memoise 1.1.0 2017-04-21 [1] CRAN (R 4.0.2) #> pkgbuild 1.1.0 2020-07-13 [1] CRAN (R 4.0.2) #> pkgload 1.1.0 2020-05-29 [1] CRAN (R 4.0.2) #> prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.0.2) #> processx 3.4.4 2020-09-03 [1] CRAN (R 4.0.2) #> ps 1.4.0 2020-10-07 [1] CRAN (R 4.0.3) #> R6 2.5.0 2020-10-28 [1] CRAN (R 4.0.3) #> remotes 2.2.0 2020-07-21 [1] CRAN (R 4.0.2) #> rlang 0.4.8.9002 2020-11-25 [1] Github (r-lib/rlang@b4e28cb) #> rmarkdown 2.5 2020-10-21 [1] CRAN (R 4.0.3) #> rprojroot 2.0.2 2020-11-15 [1] CRAN (R 4.0.3) #> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.2) #> stringi 1.5.3 2020-09-09 [1] CRAN (R 4.0.2) #> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.0.2) #> testthat 3.0.0.9000 2020-11-23 [1] Github (r-lib/testthat@45a9c70) #> usethis 1.6.3 2020-09-17 [1] CRAN (R 4.0.2) #> withr 2.3.0 2020-09-22 [1] CRAN (R 4.0.3) #> xfun 0.19 2020-10-30 [1] CRAN (R 4.0.3) #> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.2) #> #> [1] /home/elio/R/x86_64-pc-linux-gnu-library/4.0 #> [2] /usr/local/lib/R/site-library #> [3] /usr/lib/R/site-library #> [4] /usr/lib/R/library ```
ben-schwen commented 3 years ago

Although "normal" modify by reference works, chaining expressions seems to break this too...

library(data.table)
library(magrittr)

data = data.table(a = 1)
data %>% .[, b := 2][]

Error in [.data.table(., .[, :=(b, 2)], ) : When i is a data.table (or character vector), the columns to join by must be specified using 'on=' argument (see ?data.table), by keying x (i.e. sorted, and, marked as sorted, see ?setkey), or by sharing column names between x and i (i.e., a natural join). Keyed joins might have further speed benefits on very large data due to x being sorted in RAM.

Using Rs native pipe operator on devel seems to work fine

library(data.table)
data = data.table(a = 1)
data |> \(d) d[, b := 2][]

> data  
   a b  
1: 1 2  

However using Rs native pipe together with your example seems to also break the "intended" behavior (although not throwing an error it does not change the reference anymore)

library(data.table)
data = data.table(a = 1)
colnames(data) <- "a"
data |> \(d) d[, b := 2]

> data
   a  
1: 1  
eliocamp commented 3 years ago

Although "normal" modify by reference works, chaining expressions seems to break this too...

Yes, this is a known "limitation" of the %>% .[] syntax. Once you use it on a chain, you cannot use "native" data.table chaining.

However using Rs native pipe together with your example seems to also break the "intended" behavior (although not throwing an error it does not change the reference anymore)

Thanks for testing that. Since the native pipe is only a syntax transformation, if shows that the issue is not due to some obscure magrittr shenanigans but to something on the data.table side.