collapse with data.table

Hi, yeah I can create a short vignette, but essentially there is not so much to say apart from that you can apply collapse functions to a data.table and always get a data.table back e.g. DT %>% fgroup_by(a, b, c) %>% fmean or collap(DT, ~ a + b + c, fmean, keep.col.order = FALSE) will give you the same thing as DT[, lapply(.SD, mean, na.rm = TRUE), keyby = c("a","b","c")]. There is also a function qDT in collapse to quickly (column-wise) convert various objects to data.table; you can row-wise convert a matrix to data.table using mrtl(mat, names = TRUE, return = "data.table"), and you can convert a nested list of stuff to data.table using unlist2d(l, DT = TRUE).

What is important is not to do something like DT[, lapply(.SD, fmean), keyby = c("a","b","c")]. This will work but execute very slowly because collapse statistical functions like fmean are S3 generic. It is important to understand that collapse is not based around applying functions to data by groups using some mechanism as in dplyr or data.table. The grouped computation is done internally in C++. data.table internally optimizes some functions like mean or sum, which collapse allows you to do explicitly and programmatically by offering an optimized (grouped) set of statistical functions.

So to harness the speed of collapse functions we need to use the internal grouping mechanism of these functions. What you can do is something like DT[, fmean(.SD, list(a, b, c)), .SDcols = setdiff(names(DT), c("a","b","c"))], this will give you a fast computation but you are of course missing the grouping columns a, b and c in the output. So best use the fgroup_by mechanism or the collap function and just apply to data.tables as to any other data.frame. For := operations it is easier to use collapse functions inside data.table. Here are some equivalent operations:

DT[, v1_sum := sum(v1, na.rm = TRUE), by = c("a","b","c")]
DT[, v1_sum := fsum(v1, list(a, b, c), TRA = "replace_fill")]  # "replace_fill" overwrites missing values, "replace" keeps them

DT[, v1_demean := v1 - mean(v1, na.rm = TRUE), by = c("a","b","c")]
DT[, v1_demean := fwithin(v1, list(a, b, c))]  

DT[order(t), v1_lag2 := shift(v1, 2), by = c("a","b","c")]
DT[, v1_lag2 := flag(v1, 2, list(a, b, c), t)]

collapse also offers a substitute for := through the function settransform i.e. settransform(DT, v1_demean = fwithin(v1, list(a, b, c)), which will work with any data.frame or list. But of course := in data.table is nice and parsimonious and might be more memory efficient.

That's about all there is to it. I personally use data.table a lot for joins, melting and recasting and all other stuff I cannot do with collapse (like applying regression models to data by groups, e.g. DT[, qDT(lmtest::coeftest(lm(v1 ~ v2 + v3))), by = c("a","b","c")]). For statistical computations (aggregation, transformations, weighted stuff, panel data, matrix computations), collapse offers more advanced possibilities and flexibility than data.table.

SebKrantz / collapse

collapse with data.table #68