Open mattdowle opened 3 years ago
My 2¢:
While having across
implemented similarly to dplyr::across
will have utility if implemented only with .SD
, I think its utility can be increased further if we could get across to work with a character, patterns
or .SD
to allow it to work the way dplyr::across
allows. Example from here:
df %>%
group_by(g1, g2) %>%
summarise(
across(where(is.numeric), mean),
across(where(is.factor), nlevels),
n = n(),
)
Where is this useful?
Creating different summary statistics (mean, median, unique counts, count of rows, % of any x
over y
) instead of doing a join after creating those columns, or chaining. Something that gives good flexibility (like a lapply
and .SD
currently do):
as.data.table(Lahman::Batting)[, .(
across(patterns("$R^|(X.B)|HR"), .(sum, mean)),
across(c("stint", "teamID"), .(last, uniqueN)),
across(.SD, .(uniqueN, \(x) sum(x)/uniqueN(yearID))),
playerID,
.SDcols = c("R", "IBB", "SO")]
My understanding of baseball is a little rusty. What I was aiming for was to create a single table by payer that gives me
all in one shot. I often have to revert to summarizing my data in Excel or, if the data is too big, calculate all these independently in R and do a join at the end. In this specific case, we could even employ the proposed mergelist
#4370 to create these efficiently, perhaps in parallel? It'll be amazing to have the ability to create this in R.
P.S. - I realize that .SD provides a list, while the others provide character vectors. I was aiming for more flexibility as opposed to using only .SD
. It might also work if there was a way to split .SD
into .SD_1
, .SD_2
etc, but that would probably be a bit much.
Hi @mattdowle, I think what you're proposing here is closely related to what's being discussed in #1063, specifically the discussion around "colwise". #1063 groups together row operations and column operations into one issue, but they seem pretty separate to me (and I think rowwise operations would be better solved by implementing more functions like pmin/pmax as you suggested for psum in #3467).
I also like your suggestion of across (rather than colwise), and the proposed syntax since it will be familiar to dplyr
users. It seems like you're suggesting the second argument be "..." to take an arbitrary number of functions, but I think it would be better if the second argument took a single function or a list like dplyr across (https://dplyr.tidyverse.org/reference/across.html).
Before we can implement across, I think we should solve #2311 by merging my PR #4883, since this addresses how columns are named in this situation.
Once #4883 is merged, we could just implement across so that it expands into several lapply(.SD,) calls concatenated together with c(). This will ensure GForce optimization is used without any additional work. E.g:
x <- data.table(a=1:3,b=1:3)
x[, across(.SD, list(min=min, max=max)]
would just internally be expanded to
x[, c(min=lapply(.SD, min), max=lapply(.SD, max))]
and the resulting column names would be c("min.a", "min.b", "max.a", "max.b")
which is consistent with base R and how naming is implemented in #4883.
An outstanding question is how we might name columns when functions are not explicitly named:
x[, across(.SD, list(min, max)]
Interactively it would be convenient for them to be c("min.a", "min.b", "max.a", "max.b")
without explicitly tagging each function, but this might break down when the list of functions is specified programmatically so perhapsc("F1.a", "F1.b", "F2.a", "F2.b")
is more predictable.
Also see this discussion with Hadley (https://github.com/tidyverse/dtplyr/issues/173) on translating dplyr::across to data.table syntax for the dtplyr package.
Note that the dplyr across allows arbitrary specification of how the function name and input column names are combined to determine how the output columns are named (specifying both order and the separator) but I'm not sure that's a road we want to go down (see the names argument here: https://dplyr.tidyverse.org/reference/across.html) . Sticking with base R's naming behavior (e.g. c(A=list(a=3,2), B=list(a=1,b=2))
) will be much easier to maintain since naming in across will just rely directly on how naming works for x[, c(A=lapply(.SD), B=lapply(.SD))]
(once #4883 is merged) without any additional magic code unique to across.
Inspired by
dplyr::across
and triggered by https://github.com/JuliaData/DataFrames.jl/pull/2725#discussion_r622424355Instead of :
it could be
I didn't find any related issues or PRs in a quick search. If there are any, and S.O. questions, please link them here.