Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.63k stars 987 forks source link

FR: preserve names of functions inside `data.table` when `by = ` #6219

Open iago-pssjd opened 4 months ago

iago-pssjd commented 4 months ago

See next MRE:

table with by

> DT = data.table(x=rep(c("b","a","c"),each=3), y=c(1,3,6), v=1:9, w = c(rep(1, 5), rep(2, 4)))
> table(DT[, x, by = w])
   x
w   a b c
  1 2 3 0
  2 1 0 3
> DT[, table(x), by = w]
       w      V1
   <num> <table>
1:     1       2
2:     1       3
3:     2       1
4:     2       3
# table inside DT[...], we cannot see a,b,c
> table(DT[, x, by = w])
   x
w   a b c
  1 2 3 0
  2 1 0 3
# table outside DT[...], we can see a,b,c

table without by = ...

> DT[, table(x)]
x
a b c 
3 3 3 
# not a data.table, but
> DT[, .(table(x))]
        x     N
   <char> <int>
1:      a     3
2:      b     3
3:      c     3
# great! it shows a,b,c, however, with by = ...
> DT[, .(table(x)), by = w]
Error en `[.data.table`(DT, , .(table(x)), by = w): 
  All items in j=list(...) should be atomic vectors or lists. If you are trying something like j=list(.SD,newcol=mean(colA)) then use := by group instead (much quicker), or cbind or merge afterwards.

Same (except for .(...)) with summary:

> summary(DT[, v])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      1       3       5       5       7       9 
> DT[, summary(v)]
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      1       3       5       5       7       9 
> DT[, .(summary(v))]
Error en dimnames(x) <- dnx: 
  'dimnames' aplicado a un objeto que no es un arreglo
> summary(DT[, v, by = w])
       w               v    
 Min.   :1.000   Min.   :1  
 1st Qu.:1.000   1st Qu.:3  
 Median :1.000   Median :5  
 Mean   :1.444   Mean   :5  
 3rd Qu.:2.000   3rd Qu.:7  
 Max.   :2.000   Max.   :9  
> DT[, summary(v), by = w]
        w               V1
    <num> <summaryDefault>
 1:     1             1.00
 2:     1             2.00
 3:     1             3.00
 4:     1             3.00
 5:     1             4.00
 6:     1             5.00
 7:     2             6.00
 8:     2             6.75
 9:     2             7.50
10:     2             7.50
11:     2             8.25
12:     2             9.00
> DT[, .(summary(v)), by = w]
        w               V1
    <num> <summaryDefault>
 1:     1             1.00
 2:     1             2.00
 3:     1             3.00
 4:     1             3.00
 5:     1             4.00
 6:     1             5.00
 7:     2             6.00
 8:     2             6.75
 9:     2             7.50
10:     2             7.50
11:     2             8.25
12:     2             9.00

Ideally

> DT[, table(x), by = w]
       w     x      V1
   <num> <char> <table>
1:     1     a       2
2:     1     b       3
3:     1     c       0
4:     2     a       1
5:     2     b       0
6:     2     c       3

(well, maybe rows 3 and 5 only would have sense if x is a factor) and similar for DT[, .(summary(v)), by = w] with Min, ...

tdhock commented 4 months ago

the inconsistency between using table(x) with by and without by is strange, I agree. but in data.table I think the better way of doing contingency tables is via

> DT[, .(count=.N), by=.(x,w)]
        x     w count
   <char> <num> <int>
1:      b     1     3
2:      a     1     2
3:      a     2     1
4:      c     2     3
> dcast(DT, x + w ~ ., length)
Key: <x, w>
        x     w     .
   <char> <num> <int>
1:      a     1     2
2:      a     2     1
3:      b     1     3
4:      c     2     3
> dcast(DT, w ~ x, length)
Key: <w>
       w     a     b     c
   <num> <int> <int> <int>
1:     1     2     3     0
2:     2     1     0     3
MichaelChirico commented 4 months ago

I find the example a bit confusing, what is DT[, x, by=w] meant to be? Why wouldn't we just do table(DT$x, DT$w) or DT[, table(w, x)]?

Perhaps you just want to coerce the table output to data.table?

DT[, as.data.table(table(x)), by=w]
#        w      x     N
#    <num> <char> <int>
# 1:     1      a     2
# 2:     1      b     3
# 3:     2      a     1
# 4:     2      c     3

To do this "automatically", we'd have to do something like, for each entry in the output of j, check if an as.data.table() or as.data.frame() method exists. My instinct is to let the user decide whether to do such coercion; regardless, it will almost surely be a breaking change to do this on the user's behalf starting now.

As Toby points out, dcast() can also help filling out the "missing" entries here.

tdhock commented 4 months ago

On second thought I agree with Michael, the consistency issue is actually not that surprising. my mental model is that when there is no by then we just run the code in j and return the result, but when there is by then we do something different: run the code in j for each value of by columns, store each result in a list, then call rbindlist to obtain final result.

iago-pssjd commented 4 months ago

Ok! such objections are fine, and we can close this issue, but on the last comment by @tdhock

but when there is by then we do something different: run the code in j for each value of by columns, store each result in a list, then call rbindlist to obtain final result.

If that if the case, running the code in j for each value, therefore, without by data.table returns

> DT[, table(x)]
x
a b c 
3 3 3 
# not a data.table, but
> DT[, .(table(x))]
        x     N
   <char> <int>
1:      a     3
2:      b     3
3:      c     3
> as.data.table(DT[, table(x)])
        x     N
   <char> <int>
1:      a     3
2:      b     3
3:      c     3

so calling rbindlist should preserve column x. Shouldn't?

MichaelChirico commented 4 months ago

DT[, .(table(x))] is running as.data.table.list(), which in turn invokes as.data.table.table():

https://github.com/Rdatatable/data.table/blob/e4cc68c8004fdf5fce366f72d9665ea9c729e8ca/R/as.data.table.R#L148

In DT[, table(x)] , j is not a list so as.data.table.list() does not get invoked. So I would add a bit of nuance to Toby's mental model:

when there is no by and j is not a list then we just run the code in j and return the result, but when there is by or j is a list, then we do something different: run the code in j for each value of by columns, store each result in a list (if not already done), apply as.data.table to the list, then call rbindlist to obtain final result.