Open arunsrinivasan opened 9 years ago
I think it's a very important request.
I'm not convinced this is needed, especially not col-wise. lapply
already works well, I don't think we need another wheel for that. For row-wise we already have apply
, rowStats
in the fUtilities package, row*
functions in the matrixStats package, etc. etc. I don't think this needs to be reinvented in data.table.
@izahn nobody is reinventing anything here. It is a necessity. I'm not sure why you're getting worked up.
rowwise()
- apply()
doesn't cut it because it converts the input object to a matrix first. And that's an absolute waste. We want to be able to do things quite efficiently. And definitely anything is more efficient than having to allocate memory to just create a matrix out of the same data!rowStats()
- I'm not aware of this package. Good to know. But if it works on matrices, then it's a no-go as well. Because it comes back to (1). And even if it works on data.frames, then the issue is that we won't be able to escape eval()
from C-side by using those functions from other packages. Evaluating functions for each row is costly. And we'd most certainly want to avoid it.colwise()
- data.table has always tried to use base R functions whenever possible. That is the reason why we have not had any such functions implemented until now. But there have been feature requests / questions for functionality like dplyr::summarise_each()
. There's no equivalent for this in base R. No, lapply()
and mapply() / Map()
both don't cut it. Unless there is a way to apply multiple functions, each, to multiple columns using base R's apply family, that looks as clean as a simple lapply()
, there seems to be no reason to not implement this functionality.
Even here, avoiding eval()
cost is a priority. The GForce family of functions in data.table
(inspired by dplyr's hybrid evaluation) or the hybrid evaluation
family of functions in dplyr
do precisely that. And having own versions also helps parallelising at a later point. (Something that Romain touched upon in his keynote speech at UseR'15). We'd like to extend it to more common functions so that the implementations are as efficient as possible. This is tied closely to the philosophy of data.table.
Out of curiosity, do you use data.table
package directly or through dplyr()
? If you do use directly, I can't think of a reason why you'd not want this feature.. :-O
@arunsrinivasan I'm not worked up at all, sorry if I did something to give you that impression. As background (and because you asked about dplyr), I was a happy user of plyr but when dplyr came around I was dismayed by the direction it took of completely re-implementing data manipulation in R. I much prefer to the data.table approach of sticking with idioms that are applicable everywhere rather than the walled garden approach of dplyr. I'm just trying to encourage data.table developers not to go down the dplyr road of creating suites of functions that only work well inside the workflow dictated by the package, and to encourage reuse of existing functions and idioms.
@izahn thanks for clarifying. I really read your previous message differently . Sorry about that.
I agree with you entirely on (not forcing) the walled garden approach. Unfortunately there's not a clean equivalent for applying multiple functions to multiple columns each.. (or I'm at least not aware of it). For example:
dt[, c(lapply(.SD, mean), lapply(.SD, sum)), by=z]
requires good knowledge of base R (which I don't think everyone cares for these days). And then there's the readability issue that people are quite worried about these days (as opposed to understandability).
I'll try asking on SO or R-help if there's a way using base-R (or let me know if you can think of it). Otherwise, I'm not sure if it's possible to avoid it, since users really require this functionality.
Thinking a bit more about this, I think, function like each()
(from plyr
) might do the job..
dt[, lapply(.SD, colwise(sum, mean)), by=z]
for example. The extra arguments can go after each
and that'll be passed to all functions.
In this case, the rowwise()
reduces to:
dt[, lapply(.SD, rowwise(...))] # Edit: hm.. this isn't quite right, really.
and perhaps these are easy to query optimise internally.
Thank you for following up @arunsrinivasan. I like the each
idea. I'll
follow up with some other ideas when I get back to a computer in the
morning.
On Jul 7, 2015 5:06 PM, "Arun" notifications@github.com wrote:
Thinking a bit more about this, I think, function like each() (from plyr) might do the job..
dt[, lapply(.SD, each(sum, mean)), by=z]
for example. The extra arguments can go after each and that'll be passed to all functions.
In this case, the rowwise() reduces to:
dt[, lapply(unlist(.SD), each(...)), by=1:nrow(dt)]
and perhaps these are easy to query optimise internally.
— Reply to this email directly or view it on GitHub https://github.com/Rdatatable/data.table/issues/1063#issuecomment-119341459 .
I somewhat agree with @izahn. I think I would prefer if you could use apply
for syntax, e.g., dt[, apply(.SD, 2, function(x) c(mean(x), sd(x))), by = z]
and dt[, apply(.SD, 1, function(x) c(mean(x), sd(x))), by = z]
. apply
would need to become a generic with a data.table method, which could then be optimized for specific functions. A slight syntax improvement would be dt[, apply(.SD, 2, list(mean, sd)), by = z]
, which would deviate only slightly from base `apply´. I don't know how difficult this would be to implement though.
@ecoRoland apply(.SD,1,function(x) ...)
is somewhat limiting, since it implies that all columns are converted to the same class (so you can't have x[1]
be character and x[-1]
numeric, for example). Even if that were left out of apply.data.table
(so that the function could act on a list instead of an atomic vector), I feel like that would be too big a departure from base R.
I'd rather see (more) optimization behind the scenes (as is already done for mean
, etc.) and fewer new functions. I agree that a multiple-function version of lapply
would be great, but
colwise
essentially is lapply
and rowwise
could be triggered by by=.I
.Dear Arun, Definitely these functions in data.table will be a great asset. I applied MatrixStats to process median from a remote sensing time series but conversion to matrix is computing demanding and memory costly (specially if you consider parallelization). Moreover, MatrixStats do not compute pretty well quantiles or interquartiles, so I have to use other library from the bioconductor repository called WGCNA. It´s function 'rowQuantileC' is quite fast and efficient (can manage NA values) but again conversion to matrix is a pitfall. In my belief, these rowise functions should be programmed in C as base functions in R can´t manage it efficiently (i.e. millons of columns and rows * by its multiple dimensions), which data.table can do it efficiently.
apply
would need to become a generic with a data.table method
this is a matter of substituting apply
call to our internal rowwise
.
rowwise
could be triggered byby=.I
.
what if j
does not use multiple columns in single function call, like j=.(v1=sum(v1), v2=mean(v2))
, I know it doesn't make sense for by=.I
but still is valid query that should not be optimized to rowwise
. While that one should be j=.(v1_v2=sum(v1, v2)
.
In terms of API the simplest but still usable would be to add new function, lets say rowapply
(we could catch apply(MARGIN=1)
and redirect) which would be well optimized for common functions. It would be tricky to make it work for arbitrary R function, as we don't know if that functions accept vector or a list. In first case all values has to be coerced to same type and copied into new vector. In latter case it could be eventually referenced. But how we can know if function expects a list or vector? lapply
doesn't have to deal with different data types.
How rowwise/rowapply should work when no by
specified?
dt[, rowapply(.SD, sum), .SDcols=v1:v2]
dt[, v1+v2]
?
IMO most of rowwise
questions could be better answered by melt
followed by grouping.
Hey @jangorecki how easy would it be to just wrap this into roll
functionalities? with window size 0 maybe?
Another example for colwise from SO: https://stackoverflow.com/questions/57386580/what-is-the-equivalent-of-mutate-at-dplyr-in-data-table OP wants to apply multiple functions to a set of columns and have the results appear in a particular order with a particular naming convention
One suggestion from a user: It would be super convenient if one could apply one function to a group of variables and another function to another group; similar to this stata syntax:
collapse (mean) var1 var2 (sum) var3 var4, by(group)
My suggestion would be to allow the .SDcols argument to take a list, similar to the measure.vars argument in melt.data.table. Maybe this could also work with the patterns function.
I.e., like this:
D[, lapply(.SD, colwise(mean, sum)), .SDcols = .(patterns("x"), patterns("y"))]
Your proposed syntax diverges from base R syntax too much IMO. Wouldn't that do?
x_cols = grep(...)
y_cols = grep(...)
D[, c(lapply(.SD[,x_cols], mean), lapply(.SD[,y_cols], sum)), .SDcols = c(x_cols, y_cols)]
No new magic needed inside DT, and magic usually is at the cost of consistency.
Yes, but I thought sub setting columns in .SD is supposed to be avoided and this might be cumbersome for more than 2 functions.
However, I don't feel qualified to comment on the consistency issue :0
For more functions it can be made using helper function. If you are worried about overhead of subsetting on .SD, you can do like this
D[, {
sd = unclass(.SD)
c(lapply(sd[x_cols], mean), lapply(sd[y_cols], sum))
}, .SDcols = c(x_cols, y_cols)]
Does it bother anyone else that the following lapply(.SD) optimization substitution eats named arguments of c
? This should be a simple, optimized, "base-R" way to get column names which are distinguishable in the presence of an arbitrary number of functions, but the names go nowhere:
library(data.table)
mtcarsdt <- as.data.table(mtcars)
mtcarsdt[, c(mean=lapply(.SD,mean),sum=lapply(.SD,sum)), by="cyl",.SDcols=3:5]
#> cyl disp hp drat disp hp drat
#> 1: 6 183.3143 122.28571 3.585714 1283.2 856 25.10
#> 2: 4 105.1364 82.63636 4.070909 1156.5 909 44.78
#> 3: 8 353.1000 209.21429 3.229286 4943.4 2929 45.21
#how it should work according to base R named concatenation of named lists
c(A=list(a=1:3,b=1:3),B=list(a=1:3,b=1:3))
#> $A.a
#> [1] 1 2 3
#>
#> $A.b
#> [1] 1 2 3
#>
#> $B.a
#> [1] 1 2 3
#>
#> $B.b
#> [1] 1 2 3
Created on 2021-01-30 by the reprex package (v0.3.0)
@myoung3 I think that issue https://github.com/Rdatatable/data.table/issues/2311 covers it (?)
The idea is to implement all possible functions in C to directly operate on lists. To begin with, maybe:
Implementations should be for both row- and column-wise for lists/data.tables.
This'll enable us to:
This will:
lapply
which makes it tedious to aggregate using multiple functions.