Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.57k stars 974 forks source link

Unexpected behavior in `.BY` #5389

Closed cstepper closed 1 week ago

cstepper commented 2 years ago

Hi,

we are trying to extract the .BY information to a list object outside the current data.table. Unfortunately, we get unexpected results. All elements of the list of length(by groups) contain the same value - the .BY information from the last group. When using .GRP instead, the list is populated with different values - the correct group indices.

Is this expected behavior, and how can I get a list of correct by-group values be extracted?

Thanks!

library(data.table)

dt = data.table(l = sample(letters[1:5], 100, replace = TRUE), b = rnorm(100))
setkey(dt, l)

byg = list()
grp = list()
dt[
  , m := {
    byg <<- append(byg, .BY)
    grp <<- append(grp, .GRP)
    mean(b)
  }
  , by = l
]

str(byg)
#> List of 5
#>  $ l: chr "e"
#>  $ l: chr "e"
#>  $ l: chr "e"
#>  $ l: chr "e"
#>  $ l: chr "e"
str(grp)
#> List of 5
#>  $ : int 1
#>  $ : int 2
#>  $ : int 3
#>  $ : int 4
#>  $ : int 5

sessionInfo()
#> R version 4.2.0 (2022-04-22)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] data.table_1.14.2
#> 
#> loaded via a namespace (and not attached):
#>  [1] rstudioapi_0.13   knitr_1.39        magrittr_2.0.3    R.cache_0.15.0   
#>  [5] rlang_1.0.2       fastmap_1.1.0     fansi_1.0.3       stringr_1.4.0    
#>  [9] styler_1.7.0      highr_0.9         tools_4.2.0       xfun_0.31        
#> [13] R.oo_1.24.0       utf8_1.2.2        cli_3.3.0         withr_2.5.0      
#> [17] htmltools_0.5.2   ellipsis_0.3.2    yaml_2.3.5        digest_0.6.29    
#> [21] tibble_3.1.7      lifecycle_1.0.1   crayon_1.5.1      purrr_0.3.4      
#> [25] R.utils_2.11.0    vctrs_0.4.1       fs_1.5.2          glue_1.6.2       
#> [29] evaluate_0.15     rmarkdown_2.14    reprex_2.0.1      stringi_1.7.6    
#> [33] compiler_4.2.0    pillar_1.7.0      R.methodsS3_1.8.1 pkgconfig_2.0.3

Created on 2022-05-23 by the reprex package (v2.0.1)

ben-schwen commented 2 years ago

Seems more to be an issue of the collecting procedure, since printing the values works as expected.

library(data.table)
dt = data.table(l = sample(letters[1:5], 100, replace = TRUE), b = rnorm(100))
setkey(dt, l)

dt[
  , m := {
    cat("BY:", as.character(.BY), "\n")
    cat("GRP:", as.character(.GRP), "\n")
    mean(b)
  }
  , by = l
]
#> BY: a 
#> GRP: 1 
#> BY: b 
#> GRP: 2 
#> BY: c 
#> GRP: 3 
#> BY: d 
#> GRP: 4 
#> BY: e 
#> GRP: 5
ben-schwen commented 2 years ago

It seems that Rs lazy evaluation comes here into play. Forcing evaluation counters this.

library(data.table)

dt = data.table(l = sample(letters[1:5], 100, replace = TRUE), b = rnorm(100))
setkey(dt, l)

byg = list()
grp = list()
dt[
  , m := {
    byg <<- append(byg, force(unlist(.BY)))
    grp <<- append(grp, .GRP)
    mean(b)
  }
  , by = l
]
str(byg)
#> List of 5
#>  $ l: chr "a"
#>  $ l: chr "b"
#>  $ l: chr "c"
#>  $ l: chr "d"
#>  $ l: chr "e"
str(grp)
#> List of 5
#>  $ : int 1
#>  $ : int 2
#>  $ : int 3
#>  $ : int 4
#>  $ : int 5
cstepper commented 2 years ago

Thanks for your comments @ben-schwen .

When using unlist(), forcing is not required. However, I haven't found a solution for getting back .BY without modifying it before (I do not want to unlist due to coercion). My feeling is, that when sending the .BY to a function using some src code (e.g. unlist, purrr::flatten), it works.

Is this expected? Any ideas for just returning the .BY lists?

library(data.table)

dt = data.table(
  l = sample(letters[1:2], 100, replace = TRUE)
  , n = sample(1L:2L, 100, replace = TRUE)
  , b = rnorm(100)
)

keys = c("l", "n")
setkeyv(dt, keys)

by_list = list()
by_unlist = list()
by_flatten = list()
dt[
  , m := {
    by_list <<- append(by_list, list(.BY))
    by_unlist <<- append(by_unlist, list(unlist(.BY)))
    by_flatten <<- append(by_flatten, list(purrr::flatten(.BY)))
    mean(b)
  }
  , by = keys
]

str(by_list)
#> List of 4
#>  $ :List of 2
#>   ..$ l: chr "b"
#>   ..$ n: int 2
#>  $ :List of 2
#>   ..$ l: chr "b"
#>   ..$ n: int 2
#>  $ :List of 2
#>   ..$ l: chr "b"
#>   ..$ n: int 2
#>  $ :List of 2
#>   ..$ l: chr "b"
#>   ..$ n: int 2
str(by_unlist)
#> List of 4
#>  $ : Named chr [1:2] "a" "1"
#>   ..- attr(*, "names")= chr [1:2] "l" "n"
#>  $ : Named chr [1:2] "a" "2"
#>   ..- attr(*, "names")= chr [1:2] "l" "n"
#>  $ : Named chr [1:2] "b" "1"
#>   ..- attr(*, "names")= chr [1:2] "l" "n"
#>  $ : Named chr [1:2] "b" "2"
#>   ..- attr(*, "names")= chr [1:2] "l" "n"
str(by_flatten)
#> List of 4
#>  $ :List of 2
#>   ..$ l: chr "a"
#>   ..$ n: int 1
#>  $ :List of 2
#>   ..$ l: chr "a"
#>   ..$ n: int 2
#>  $ :List of 2
#>   ..$ l: chr "b"
#>   ..$ n: int 1
#>  $ :List of 2
#>   ..$ l: chr "b"
#>   ..$ n: int 2

Created on 2022-05-27 by the reprex package (v2.0.1)

cstepper commented 2 years ago

Please let me know, if you think this should better discussed on Stackoverflow. Thanks

ben-schwen commented 2 years ago

If you take a look at the source of dogropus https://github.com/Rdatatable/data.table/blob/e9a323de01a17af70d5316016606fa8d35b25023/src/dogroups.c#L91-L104 you can see that BY is always directly assigned and overwritten for each group, so with assigning it to a list, you always assign the same memory pointer.

This is also confirmed by the following.

library(data.table)
dt = data.table(l = sample(letters[1:5], 100, replace = TRUE), b = rnorm(100))
setkey(dt, l)

by_list = list()
# you should preallocate in R
# by_list = vector(mode="list", length(unique(dt$l)))

dt[
  , m := {
    by_list <<- append(by_list, copy(.BY))
    cat(address(.BY), "\n")
    mean(b)
  }
  , by = l
]
#> 0x56343a667930 
#> 0x56343a667930 
#> 0x56343a667930 
#> 0x56343a667930 
#> 0x56343a667930
by_list
#> $l
#> [1] "a"
#> 
#> $l
#> [1] "b"
#> 
#> $l
#> [1] "c"
#> 
#> $l
#> [1] "d"
#> 
#> $l
#> [1] "e"

You can achieve your wanted behavior with using copy(.BY). Similar info is in the help of ?.BY but it only mentions this behavior for .SD.

The working assignment without copying probably depends on whether R is internally copying or not. In previous R versions, it would always copy but R-core improved this a lot in the last years.

Anyway, we should clarify this in the documentation. So I guess here is the right place to discuss it.