Allow for row-wise fun.aggregate in dcast

chrisvoorpostel commented 7 years ago

It would be nice to be able to choose how fun.aggregates are propagated when more than one are used.

For example, if I have:

DT <- data.table(x = rep(c(1:3),4),y = rep(c(1:4),3), z = c(1:3))
dcast(DT, x ~ y, value.var = "z", fun.aggregate = list(length,sum))
   x z_length_1 z_length_2 z_length_3 z_length_4 z_sum_1 z_sum_2 z_sum_3 z_sum_4
1: 1          1          1          1          1       1       1       1       1
2: 2          1          1          1          1       2       2       2       2
3: 3          1          1          1          1       3       3       3       3

The equations are repeated by binding the results together column-wise. Instead, I'd like to do the equivalent of this, directly from the syntax of dcast:

DT <- DT[, .(len = length(z), sum_z = sum(z)), by = .(x, y)]
DT <- melt(DT, id.vars = c("x","y"), variable.name = "func_name")
dcast(DT, func_name + x ~ y)
   func_name x 1 2 3 4
1:       len 1 1 1 1 1
2:       len 2 1 1 1 1
3:       len 3 1 1 1 1
4:     sum_z 1 1 1 1 1
5:     sum_z 2 2 2 2 2
6:     sum_z 3 3 3 3 3

which gets fairly messy with many functions + columns.

This feels like a reasonable extension of the casting syntax. Maybe something like:

 dcast(DT, x ~ y + .FA, value.var = "z", fun.aggregate = list(length,sum)) # default
 dcast(DT, .FA + x ~ y, value.var = "z", fun.aggregate = list(length,sum))

to allow for control of where the measures are split.

franknarf1 commented 7 years ago

I guess it kind of breaks the rule that the LHS of LHS ~ RHS determines the rows in the result. I like the .FA idea, but then there's the question of how to do this if you have multiple value.var (which for me is more common than multiple functions). Maybe also have .VV, with .FA + .VV + x ~ y meaning fun.agg and value.var get separate cols; and maybe even allowing the combination rule to be set, paste0(.FA, .VV) + x ~ y or something.

chrisvoorpostel commented 7 years ago

I think it keeps the rule - you are just explicitly setting what you are seeing in the row (whether it is .FA or .VV). If anything the current functionality breaks the rule that RHS is the columns of the result (if that is a rule), because .VV and .FA are implicitly column-wise.

I also like the .VV idea, as it allows for things like .FA + x ~ y + .VV, or any order of, keeping the same syntax

Rdatatable / data.table

Allow for row-wise fun.aggregate in dcast #1911