Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.58k stars 977 forks source link

Allow for row-wise fun.aggregate in dcast #1911

Open chrisvoorpostel opened 7 years ago

chrisvoorpostel commented 7 years ago

It would be nice to be able to choose how fun.aggregates are propagated when more than one are used.

For example, if I have:

DT <- data.table(x = rep(c(1:3),4),y = rep(c(1:4),3), z = c(1:3))
dcast(DT, x ~ y, value.var = "z", fun.aggregate = list(length,sum))
   x z_length_1 z_length_2 z_length_3 z_length_4 z_sum_1 z_sum_2 z_sum_3 z_sum_4
1: 1          1          1          1          1       1       1       1       1
2: 2          1          1          1          1       2       2       2       2
3: 3          1          1          1          1       3       3       3       3

The equations are repeated by binding the results together column-wise. Instead, I'd like to do the equivalent of this, directly from the syntax of dcast:

DT <- DT[, .(len = length(z), sum_z = sum(z)), by = .(x, y)]
DT <- melt(DT, id.vars = c("x","y"), variable.name = "func_name")
dcast(DT, func_name + x ~ y)
   func_name x 1 2 3 4
1:       len 1 1 1 1 1
2:       len 2 1 1 1 1
3:       len 3 1 1 1 1
4:     sum_z 1 1 1 1 1
5:     sum_z 2 2 2 2 2
6:     sum_z 3 3 3 3 3

which gets fairly messy with many functions + columns.

This feels like a reasonable extension of the casting syntax. Maybe something like:

 dcast(DT, x ~ y + .FA, value.var = "z", fun.aggregate = list(length,sum)) # default
 dcast(DT, .FA + x ~ y, value.var = "z", fun.aggregate = list(length,sum))

to allow for control of where the measures are split.

franknarf1 commented 7 years ago

I guess it kind of breaks the rule that the LHS of LHS ~ RHS determines the rows in the result. I like the .FA idea, but then there's the question of how to do this if you have multiple value.var (which for me is more common than multiple functions). Maybe also have .VV, with .FA + .VV + x ~ y meaning fun.agg and value.var get separate cols; and maybe even allowing the combination rule to be set, paste0(.FA, .VV) + x ~ y or something.

chrisvoorpostel commented 7 years ago

I think it keeps the rule - you are just explicitly setting what you are seeing in the row (whether it is .FA or .VV). If anything the current functionality breaks the rule that RHS is the columns of the result (if that is a rule), because .VV and .FA are implicitly column-wise.

I also like the .VV idea, as it allows for things like .FA + x ~ y + .VV, or any order of, keeping the same syntax