Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.52k stars 967 forks source link

Implement DT[, across(.SD, fun1, fun2, fun3), by=group] #4970

Open mattdowle opened 3 years ago

mattdowle commented 3 years ago

Inspired by dplyr::across and triggered by https://github.com/JuliaData/DataFrames.jl/pull/2725#discussion_r622424355

Instead of :

DT[, unlist(lapply(.SD, function(x) c(max=max(x), min=min(x)))), by=group]

it could be

DT[, across(.SD, min, max), by=group]

I didn't find any related issues or PRs in a quick search. If there are any, and S.O. questions, please link them here.

avimallu commented 3 years ago

My 2¢: While having across implemented similarly to dplyr::across will have utility if implemented only with .SD, I think its utility can be increased further if we could get across to work with a character, patterns or .SD to allow it to work the way dplyr::across allows. Example from here:

df %>%
  group_by(g1, g2) %>% 
  summarise(
    across(where(is.numeric), mean), 
    across(where(is.factor), nlevels),
    n = n(), 
  )

Where is this useful? Creating different summary statistics (mean, median, unique counts, count of rows, % of any x over y) instead of doing a join after creating those columns, or chaining. Something that gives good flexibility (like a lapply and .SD currently do):

as.data.table(Lahman::Batting)[, .(
  across(patterns("$R^|(X.B)|HR"), .(sum, mean)),
  across(c("stint", "teamID"), .(last, uniqueN)),
  across(.SD, .(uniqueN, \(x) sum(x)/uniqueN(yearID))),
  playerID,
  .SDcols = c("R", "IBB", "SO")]

My understanding of baseball is a little rusty. What I was aiming for was to create a single table by payer that gives me

  1. the sum and mean of runs, doubles, triples and home-runs, followed by
  2. the last and unique count of stints and teams player for, followed by
  3. the number of runs, international walks and strikeouts per year

all in one shot. I often have to revert to summarizing my data in Excel or, if the data is too big, calculate all these independently in R and do a join at the end. In this specific case, we could even employ the proposed mergelist #4370 to create these efficiently, perhaps in parallel? It'll be amazing to have the ability to create this in R.

P.S. - I realize that .SD provides a list, while the others provide character vectors. I was aiming for more flexibility as opposed to using only .SD. It might also work if there was a way to split .SD into .SD_1, .SD_2 etc, but that would probably be a bit much.

myoung3 commented 3 years ago

Hi @mattdowle, I think what you're proposing here is closely related to what's being discussed in #1063, specifically the discussion around "colwise". #1063 groups together row operations and column operations into one issue, but they seem pretty separate to me (and I think rowwise operations would be better solved by implementing more functions like pmin/pmax as you suggested for psum in #3467).

I also like your suggestion of across (rather than colwise), and the proposed syntax since it will be familiar to dplyr users. It seems like you're suggesting the second argument be "..." to take an arbitrary number of functions, but I think it would be better if the second argument took a single function or a list like dplyr across (https://dplyr.tidyverse.org/reference/across.html).

Before we can implement across, I think we should solve #2311 by merging my PR #4883, since this addresses how columns are named in this situation.

Once #4883 is merged, we could just implement across so that it expands into several lapply(.SD,) calls concatenated together with c(). This will ensure GForce optimization is used without any additional work. E.g:

x <- data.table(a=1:3,b=1:3)
x[, across(.SD, list(min=min, max=max)]  

would just internally be expanded to

x[, c(min=lapply(.SD, min), max=lapply(.SD, max))]

and the resulting column names would be c("min.a", "min.b", "max.a", "max.b") which is consistent with base R and how naming is implemented in #4883.

An outstanding question is how we might name columns when functions are not explicitly named:

x[, across(.SD, list(min, max)] 

Interactively it would be convenient for them to be c("min.a", "min.b", "max.a", "max.b") without explicitly tagging each function, but this might break down when the list of functions is specified programmatically so perhapsc("F1.a", "F1.b", "F2.a", "F2.b") is more predictable.

myoung3 commented 3 years ago

Also see this discussion with Hadley (https://github.com/tidyverse/dtplyr/issues/173) on translating dplyr::across to data.table syntax for the dtplyr package.

Note that the dplyr across allows arbitrary specification of how the function name and input column names are combined to determine how the output columns are named (specifying both order and the separator) but I'm not sure that's a road we want to go down (see the names argument here: https://dplyr.tidyverse.org/reference/across.html) . Sticking with base R's naming behavior (e.g. c(A=list(a=3,2), B=list(a=1,b=2)) ) will be much easier to maintain since naming in across will just rely directly on how naming works for x[, c(A=lapply(.SD), B=lapply(.SD))] (once #4883 is merged) without any additional magic code unique to across.