Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.57k stars 975 forks source link

Set up rowwise() and colwise() functions for use in .SD #1063

Open arunsrinivasan opened 9 years ago

arunsrinivasan commented 9 years ago

The idea is to implement all possible functions in C to directly operate on lists. To begin with, maybe:

Implementations should be for both row- and column-wise for lists/data.tables.

This'll enable us to:

DT[, rowwise(.SD, <functions>)]
DT[, colwise(.SD, <functions>)]

This will:

skanskan commented 9 years ago

I think it's a very important request.

izahn commented 9 years ago

I'm not convinced this is needed, especially not col-wise. lapply already works well, I don't think we need another wheel for that. For row-wise we already have apply, rowStats in the fUtilities package, row* functions in the matrixStats package, etc. etc. I don't think this needs to be reinvented in data.table.

arunsrinivasan commented 9 years ago

@izahn nobody is reinventing anything here. It is a necessity. I'm not sure why you're getting worked up.

  1. rowwise() - apply() doesn't cut it because it converts the input object to a matrix first. And that's an absolute waste. We want to be able to do things quite efficiently. And definitely anything is more efficient than having to allocate memory to just create a matrix out of the same data!
  2. rowStats() - I'm not aware of this package. Good to know. But if it works on matrices, then it's a no-go as well. Because it comes back to (1). And even if it works on data.frames, then the issue is that we won't be able to escape eval() from C-side by using those functions from other packages. Evaluating functions for each row is costly. And we'd most certainly want to avoid it.
  3. colwise() - data.table has always tried to use base R functions whenever possible. That is the reason why we have not had any such functions implemented until now. But there have been feature requests / questions for functionality like dplyr::summarise_each(). There's no equivalent for this in base R. No, lapply() and mapply() / Map() both don't cut it. Unless there is a way to apply multiple functions, each, to multiple columns using base R's apply family, that looks as clean as a simple lapply(), there seems to be no reason to not implement this functionality.

    Even here, avoiding eval() cost is a priority. The GForce family of functions in data.table (inspired by dplyr's hybrid evaluation) or the hybrid evaluation family of functions in dplyr do precisely that. And having own versions also helps parallelising at a later point. (Something that Romain touched upon in his keynote speech at UseR'15). We'd like to extend it to more common functions so that the implementations are as efficient as possible. This is tied closely to the philosophy of data.table.

Out of curiosity, do you use data.table package directly or through dplyr()? If you do use directly, I can't think of a reason why you'd not want this feature.. :-O

izahn commented 9 years ago

@arunsrinivasan I'm not worked up at all, sorry if I did something to give you that impression. As background (and because you asked about dplyr), I was a happy user of plyr but when dplyr came around I was dismayed by the direction it took of completely re-implementing data manipulation in R. I much prefer to the data.table approach of sticking with idioms that are applicable everywhere rather than the walled garden approach of dplyr. I'm just trying to encourage data.table developers not to go down the dplyr road of creating suites of functions that only work well inside the workflow dictated by the package, and to encourage reuse of existing functions and idioms.

arunsrinivasan commented 9 years ago

@izahn thanks for clarifying. I really read your previous message differently . Sorry about that.

I agree with you entirely on (not forcing) the walled garden approach. Unfortunately there's not a clean equivalent for applying multiple functions to multiple columns each.. (or I'm at least not aware of it). For example:

dt[, c(lapply(.SD, mean), lapply(.SD, sum)), by=z]

requires good knowledge of base R (which I don't think everyone cares for these days). And then there's the readability issue that people are quite worried about these days (as opposed to understandability).

I'll try asking on SO or R-help if there's a way using base-R (or let me know if you can think of it). Otherwise, I'm not sure if it's possible to avoid it, since users really require this functionality.

arunsrinivasan commented 9 years ago

Thinking a bit more about this, I think, function like each() (from plyr) might do the job..

dt[, lapply(.SD, colwise(sum, mean)), by=z]

for example. The extra arguments can go after each and that'll be passed to all functions.

In this case, the rowwise() reduces to:

dt[, lapply(.SD, rowwise(...))] # Edit: hm.. this isn't quite right, really.

and perhaps these are easy to query optimise internally.

izahn commented 9 years ago

Thank you for following up @arunsrinivasan. I like the each idea. I'll follow up with some other ideas when I get back to a computer in the morning. On Jul 7, 2015 5:06 PM, "Arun" notifications@github.com wrote:

Thinking a bit more about this, I think, function like each() (from plyr) might do the job..

dt[, lapply(.SD, each(sum, mean)), by=z]

for example. The extra arguments can go after each and that'll be passed to all functions.

In this case, the rowwise() reduces to:

dt[, lapply(unlist(.SD), each(...)), by=1:nrow(dt)]

and perhaps these are easy to query optimise internally.

— Reply to this email directly or view it on GitHub https://github.com/Rdatatable/data.table/issues/1063#issuecomment-119341459 .

ecoRoland commented 9 years ago

I somewhat agree with @izahn. I think I would prefer if you could use apply for syntax, e.g., dt[, apply(.SD, 2, function(x) c(mean(x), sd(x))), by = z] and dt[, apply(.SD, 1, function(x) c(mean(x), sd(x))), by = z]. apply would need to become a generic with a data.table method, which could then be optimized for specific functions. A slight syntax improvement would be dt[, apply(.SD, 2, list(mean, sd)), by = z], which would deviate only slightly from base `apply´. I don't know how difficult this would be to implement though.

franknarf1 commented 9 years ago

@ecoRoland apply(.SD,1,function(x) ...) is somewhat limiting, since it implies that all columns are converted to the same class (so you can't have x[1] be character and x[-1] numeric, for example). Even if that were left out of apply.data.table (so that the function could act on a list instead of an atomic vector), I feel like that would be too big a departure from base R.

I'd rather see (more) optimization behind the scenes (as is already done for mean, etc.) and fewer new functions. I agree that a multiple-function version of lapply would be great, but

FSantosCodes commented 6 years ago

Dear Arun, Definitely these functions in data.table will be a great asset. I applied MatrixStats to process median from a remote sensing time series but conversion to matrix is computing demanding and memory costly (specially if you consider parallelization). Moreover, MatrixStats do not compute pretty well quantiles or interquartiles, so I have to use other library from the bioconductor repository called WGCNA. It´s function 'rowQuantileC' is quite fast and efficient (can manage NA values) but again conversion to matrix is a pitfall. In my belief, these rowise functions should be programmed in C as base functions in R can´t manage it efficiently (i.e. millons of columns and rows * by its multiple dimensions), which data.table can do it efficiently.

MichaelChirico commented 6 years ago

update when added:

jangorecki commented 5 years ago

apply would need to become a generic with a data.table method

this is a matter of substituting apply call to our internal rowwise.

rowwise could be triggered by by=.I.

what if j does not use multiple columns in single function call, like j=.(v1=sum(v1), v2=mean(v2)), I know it doesn't make sense for by=.I but still is valid query that should not be optimized to rowwise. While that one should be j=.(v1_v2=sum(v1, v2).

In terms of API the simplest but still usable would be to add new function, lets say rowapply (we could catch apply(MARGIN=1) and redirect) which would be well optimized for common functions. It would be tricky to make it work for arbitrary R function, as we don't know if that functions accept vector or a list. In first case all values has to be coerced to same type and copied into new vector. In latter case it could be eventually referenced. But how we can know if function expects a list or vector? lapply doesn't have to deal with different data types.

How rowwise/rowapply should work when no by specified?

dt[, rowapply(.SD, sum), .SDcols=v1:v2]
dt[, v1+v2]

?

IMO most of rowwise questions could be better answered by melt followed by grouping.

MichaelChirico commented 5 years ago

Hey @jangorecki how easy would it be to just wrap this into roll functionalities? with window size 0 maybe?

franknarf1 commented 5 years ago

Another example for colwise from SO: https://stackoverflow.com/questions/57386580/what-is-the-equivalent-of-mutate-at-dplyr-in-data-table OP wants to apply multiple functions to a set of columns and have the results appear in a particular order with a particular naming convention

matthiaskaeding commented 3 years ago

One suggestion from a user: It would be super convenient if one could apply one function to a group of variables and another function to another group; similar to this stata syntax:

collapse (mean) var1 var2 (sum) var3 var4, by(group)

My suggestion would be to allow the .SDcols argument to take a list, similar to the measure.vars argument in melt.data.table. Maybe this could also work with the patterns function.

I.e., like this:

D[, lapply(.SD, colwise(mean, sum)), .SDcols = .(patterns("x"), patterns("y"))]

jangorecki commented 3 years ago

Your proposed syntax diverges from base R syntax too much IMO. Wouldn't that do?

x_cols = grep(...)
y_cols = grep(...)
D[, c(lapply(.SD[,x_cols], mean), lapply(.SD[,y_cols], sum)), .SDcols = c(x_cols, y_cols)]

No new magic needed inside DT, and magic usually is at the cost of consistency.

matthiaskaeding commented 3 years ago

Yes, but I thought sub setting columns in .SD is supposed to be avoided and this might be cumbersome for more than 2 functions.

However, I don't feel qualified to comment on the consistency issue :0

jangorecki commented 3 years ago

For more functions it can be made using helper function. If you are worried about overhead of subsetting on .SD, you can do like this

D[, {
  sd = unclass(.SD)
  c(lapply(sd[x_cols], mean), lapply(sd[y_cols], sum))
}, .SDcols = c(x_cols, y_cols)]
myoung3 commented 3 years ago

Does it bother anyone else that the following lapply(.SD) optimization substitution eats named arguments of c? This should be a simple, optimized, "base-R" way to get column names which are distinguishable in the presence of an arbitrary number of functions, but the names go nowhere:

library(data.table)
mtcarsdt <- as.data.table(mtcars)
mtcarsdt[, c(mean=lapply(.SD,mean),sum=lapply(.SD,sum)), by="cyl",.SDcols=3:5]
#>    cyl     disp        hp     drat   disp   hp  drat
#> 1:   6 183.3143 122.28571 3.585714 1283.2  856 25.10
#> 2:   4 105.1364  82.63636 4.070909 1156.5  909 44.78
#> 3:   8 353.1000 209.21429 3.229286 4943.4 2929 45.21

#how it should work according to base R named concatenation of named lists
c(A=list(a=1:3,b=1:3),B=list(a=1:3,b=1:3))
#> $A.a
#> [1] 1 2 3
#> 
#> $A.b
#> [1] 1 2 3
#> 
#> $B.a
#> [1] 1 2 3
#> 
#> $B.b
#> [1] 1 2 3

Created on 2021-01-30 by the reprex package (v0.3.0)

franknarf1 commented 3 years ago

@myoung3 I think that issue https://github.com/Rdatatable/data.table/issues/2311 covers it (?)