Closed cstepper closed 1 week ago
Seems more to be an issue of the collecting procedure, since printing the values works as expected.
library(data.table)
dt = data.table(l = sample(letters[1:5], 100, replace = TRUE), b = rnorm(100))
setkey(dt, l)
dt[
, m := {
cat("BY:", as.character(.BY), "\n")
cat("GRP:", as.character(.GRP), "\n")
mean(b)
}
, by = l
]
#> BY: a
#> GRP: 1
#> BY: b
#> GRP: 2
#> BY: c
#> GRP: 3
#> BY: d
#> GRP: 4
#> BY: e
#> GRP: 5
It seems that R
s lazy evaluation comes here into play. Forcing evaluation counters this.
library(data.table)
dt = data.table(l = sample(letters[1:5], 100, replace = TRUE), b = rnorm(100))
setkey(dt, l)
byg = list()
grp = list()
dt[
, m := {
byg <<- append(byg, force(unlist(.BY)))
grp <<- append(grp, .GRP)
mean(b)
}
, by = l
]
str(byg)
#> List of 5
#> $ l: chr "a"
#> $ l: chr "b"
#> $ l: chr "c"
#> $ l: chr "d"
#> $ l: chr "e"
str(grp)
#> List of 5
#> $ : int 1
#> $ : int 2
#> $ : int 3
#> $ : int 4
#> $ : int 5
Thanks for your comments @ben-schwen .
When using unlist()
, forcing is not required.
However, I haven't found a solution for getting back .BY
without modifying it before (I do not want to unlist due to coercion).
My feeling is, that when sending the .BY
to a function using some src code (e.g. unlist
, purrr::flatten
), it works.
Is this expected? Any ideas for just returning the .BY
lists?
library(data.table)
dt = data.table(
l = sample(letters[1:2], 100, replace = TRUE)
, n = sample(1L:2L, 100, replace = TRUE)
, b = rnorm(100)
)
keys = c("l", "n")
setkeyv(dt, keys)
by_list = list()
by_unlist = list()
by_flatten = list()
dt[
, m := {
by_list <<- append(by_list, list(.BY))
by_unlist <<- append(by_unlist, list(unlist(.BY)))
by_flatten <<- append(by_flatten, list(purrr::flatten(.BY)))
mean(b)
}
, by = keys
]
str(by_list)
#> List of 4
#> $ :List of 2
#> ..$ l: chr "b"
#> ..$ n: int 2
#> $ :List of 2
#> ..$ l: chr "b"
#> ..$ n: int 2
#> $ :List of 2
#> ..$ l: chr "b"
#> ..$ n: int 2
#> $ :List of 2
#> ..$ l: chr "b"
#> ..$ n: int 2
str(by_unlist)
#> List of 4
#> $ : Named chr [1:2] "a" "1"
#> ..- attr(*, "names")= chr [1:2] "l" "n"
#> $ : Named chr [1:2] "a" "2"
#> ..- attr(*, "names")= chr [1:2] "l" "n"
#> $ : Named chr [1:2] "b" "1"
#> ..- attr(*, "names")= chr [1:2] "l" "n"
#> $ : Named chr [1:2] "b" "2"
#> ..- attr(*, "names")= chr [1:2] "l" "n"
str(by_flatten)
#> List of 4
#> $ :List of 2
#> ..$ l: chr "a"
#> ..$ n: int 1
#> $ :List of 2
#> ..$ l: chr "a"
#> ..$ n: int 2
#> $ :List of 2
#> ..$ l: chr "b"
#> ..$ n: int 1
#> $ :List of 2
#> ..$ l: chr "b"
#> ..$ n: int 2
Created on 2022-05-27 by the reprex package (v2.0.1)
Please let me know, if you think this should better discussed on Stackoverflow. Thanks
If you take a look at the source of dogropus
https://github.com/Rdatatable/data.table/blob/e9a323de01a17af70d5316016606fa8d35b25023/src/dogroups.c#L91-L104
you can see that BY is always directly assigned and overwritten for each group, so with assigning it to a list, you always assign the same memory pointer.
This is also confirmed by the following.
library(data.table)
dt = data.table(l = sample(letters[1:5], 100, replace = TRUE), b = rnorm(100))
setkey(dt, l)
by_list = list()
# you should preallocate in R
# by_list = vector(mode="list", length(unique(dt$l)))
dt[
, m := {
by_list <<- append(by_list, copy(.BY))
cat(address(.BY), "\n")
mean(b)
}
, by = l
]
#> 0x56343a667930
#> 0x56343a667930
#> 0x56343a667930
#> 0x56343a667930
#> 0x56343a667930
by_list
#> $l
#> [1] "a"
#>
#> $l
#> [1] "b"
#>
#> $l
#> [1] "c"
#>
#> $l
#> [1] "d"
#>
#> $l
#> [1] "e"
You can achieve your wanted behavior with using copy(.BY)
. Similar info is in the help of ?.BY
but it only mentions this behavior for .SD
.
The working assignment without copying probably depends on whether R is internally copying or not. In previous R versions, it would always copy but R-core improved this a lot in the last years.
Anyway, we should clarify this in the documentation. So I guess here is the right place to discuss it.
Hi,
we are trying to extract the
.BY
information to a list object outside the current data.table. Unfortunately, we get unexpected results. All elements of the list of length(by groups) contain the same value - the .BY information from the last group. When using .GRP instead, the list is populated with different values - the correct group indices.Is this expected behavior, and how can I get a list of correct by-group values be extracted?
Thanks!
Created on 2022-05-23 by the reprex package (v2.0.1)