Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.62k stars 985 forks source link

x[i, j, by=.] where `i` is a data.table and `j` contains expressions from `i` should/could work as well #733

Open arunsrinivasan opened 10 years ago

arunsrinivasan commented 10 years ago

Referring to this SO post from @juancentro:

dta <- data.table(idx=1:3, vala=4:6, fidx = c("a", "a", "b"), key="idx")
dtb <- data.table(idx=c(1,4), valb=c(10,11), key="idx")

# (1)
dta[dtb, sum(vala), by=fidx]
#    fidx V1
#1:    a  4
#2:   NA NA

# (2)
dta[dtb, sum(valb), by=fidx]
# Error in `[.data.table`(dta, dtb, sum(valb), by = fidx) : 
#   object 'valb' not found

It would be nice to have case (2) returning the right result as well.

That is, as @juancentro and @eantonya mentioned:

dta[dtb, sum(valb), by=fidx] == dta[dtb][, sum(valb), by=fidx]

would be great!

Thanks to @juancentro for reporting on SO.

juancentro commented 10 years ago

@arunsrinivasan I think the title is misleading, it should be "x[i, j, by=.] where i is a data.table and j contains expressions from i should/could work as wel". That is, change the last x to i, because you can already use x variables in the j expression

arunsrinivasan commented 10 years ago

@juancentro oops, right! Corrected now. Thanks again.

aryoda commented 9 years ago

Do I understand it right? You want to be able to access columns of the data.table passed as i parameter in the j parameter while you use a "group by"?

aryoda commented 9 years ago

Any plans to implement this feature since it seems to be quite natural?

Compare it to SQL's "select sum( i.valb) from x inner join i on x.idx = i.idx group by x.fidx"!

juancentro commented 9 years ago

This is already doable. If the names from x and i don't clash, you can just use the column names. Otherwise, prefix with i.. Please note that 1.9.4 and 1.9.6 introduced changes related to this

On Fri, Nov 6, 2015, 20:11 aryoda notifications@github.com wrote:

Any plans to implement this feature since it seems to be quite natural?

Compare it to SQL's "select sum( i.valb) from x inner join i on x.idx = i.idx group by x.fidx"!

— Reply to this email directly or view it on GitHub https://github.com/Rdatatable/data.table/issues/733#issuecomment-154572285 .

aryoda commented 9 years ago

Which column name clash causes this error message (none of the used column names is ambigious)?

# (2)
dta[dtb, sum(valb), by=fidx]
# Error in `[.data.table`(dta, dtb, sum(valb), by = fidx) : 
#   object 'valb' not found

How could a corrected version of this code could look like using data.table 1.9.6++?

juancentro commented 9 years ago

In which version are you trying? This was tested in 1.9.2 I think

On Sat, Nov 7, 2015, 06:23 aryoda notifications@github.com wrote:

Which column name clash causes this error message (none of the used column names is ambigious):

(2)dta[dtb, sum(valb), by=fidx]# Error in [.data.table(dta, dtb, sum(valb), by = fidx) : # object 'valb' not found

— Reply to this email directly or view it on GitHub https://github.com/Rdatatable/data.table/issues/733#issuecomment-154668301 .

aryoda commented 9 years ago

I tested this with data.table 1.9.4 using R3.2.2 running in Ubuntu 64 bit, then updated to data.table 1.9.6 and got the same error message (object 'valb' not found).

# sessionInfo() excerpt:
R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS
...
other attached packages:
[1] data.table_1.9.6
franknarf1 commented 9 years ago

I see the same thing as aryoda on (some iteration of) 1.9.7. by=.EACHI works fine, but nothing else (not even by="idx", the merging column).

juancentro commented 9 years ago

Now I remember. This is expected. Before the introduction of EACHI, the same behavior was generated using a data.table in the i-argument (by-without-by). Now you have to be explicit about the grouping (using EACHI)

On Sat, Nov 7, 2015, 11:27 franknarf1 notifications@github.com wrote:

I see the same thing as aryoda on (some iteration of) 1.9.7. by=.EACHI works fine, but nothing else (not even by="idx" (the merging column).

— Reply to this email directly or view it on GitHub https://github.com/Rdatatable/data.table/issues/733#issuecomment-154709881 .

aryoda commented 9 years ago

I understand the purpose of .EACHI but it is difficult to explain what I want (and why). Let's try it:

The help function explains the purpose of .EACHI quite good:

"When i is a data.table, DT[i,j,by=.EACHI] evaluates j for the groups in 'DT' that each row in i joins to. That is, you can join (in i) and aggregate (in j) simultaneously. We call this grouping by each i.".

The application of .EACHI requires a n:1 [sic! it should be: 1:n, see comment of franknarf below] relationship between x/dta/DT and the data.table in i otherwise grouping makes no sense because it works on a row-by-row base then.

What I want to do is: I want to join/merge x/dta/DT and i (data tables dta and dtb) into one new data.table and x/dta/DT and i have a 0..1:1 relationship (I want to filter by i/dtb and "enrich" the remaining rows of x/dta/DT with the columns of i/dtb and group them by a colum of i/dtb).

And I want to do this in an memory efficient way (not materializing the complete join and then do the grouping as it would happen with pipelining:

> dta[dtb][, sum(valb), by=fidx]
   fidx V1
1:    a 10
2:   NA 11

So I would expect it a natural behaviour of the data.table that this syntacical variant does the same:

dta[dtb, sum(valb), by=fidx]

And: Yes, this is possibly a feature request, not a bug (as the existing label to this "issue" already implies)

juancentro commented 9 years ago

Ok, then you dont need to use by or EACHI, simply do dta[dtb, cola := colb]

On Sat, Nov 7, 2015, 13:08 aryoda notifications@github.com wrote:

I understand the purpose of .EACHI but it is hard explain what I want (and why). Let's try it:

The help function explains the purpose of .EACHI quite good:

"When i is a data.table, DT[i,j,by=.EACHI] evaluates j for the groups in 'DT' that each row in i joins to. That is, you can join (in i) and aggregate (in j) simultaneously. We call this grouping by each i.".

The application of .EACHI requires a n:1 relationship between x/dta/DT and the data.table in i otherwise grouping makes no sense because it works on a row-by-row base then.

What I want to do is: I want to join/merge x/dta/DT and i (data tables dta and dtb) into one new data.table and x/dta/DT and i have a 0..1:1 relationship (I want to filter by i/dtb and "enrich" the remaining rows of x/dta/DT with the columns of i/dtb).

And I want to do this in an memory efficient way (not materializing the complete join and then do the grouping as it would happen with pipelining:

dta[dtb][, sum(valb), by=fidx] fidx V11: a 102: NA 11

— Reply to this email directly or view it on GitHub https://github.com/Rdatatable/data.table/issues/733#issuecomment-154719023 .

aryoda commented 9 years ago

OK, I forgot one important thing (I have edited my comment above):

I want to filter by i/dtb and "enrich" the remaining rows of x/dta/DT with the columns of i/dtb ... and group them by one or more colums of i/dtb

franknarf1 commented 9 years ago

@aryoda I don't follow what you're saying about n:1 being mandatory for the x:i when using .EACHI:

x1 = data.table(a=1:2)
y1 = data.table(a=rep(1:3,each=2), b=rnorm(6))

x1[y1, bmin := min(b), on="a", by=.EACHI]

That is 1:n, right? Anyway, I agree about

And I want to do this in an memory efficient way (not materializing the complete join and then do the grouping as it would happen with pipelining:

and think it's why this issue is still open.

aryoda commented 9 years ago

@franknarf1:

You are right, .EACHI is for 1:n, not n:1 (it does not work like SQL). Thx for pointing this out!

franknarf1 commented 9 years ago

Oh, sorry, I meant that as a counterexample to it being only n:1. It can be n:1:

x2 = data.table(a=rep(1:2,each=2))
y2 = data.table(a=1:2, b=4)

x2[y2, d := b, on="a", by=.EACHI]

works.

I don't think there is some 1:n or n:1 restriction.

aryoda commented 9 years ago

@franknarf1

Sorry, now I am totally confused (perhaps already off-topic ;-):

x1 = data.table(a=1:2)
y1 = data.table(a=rep(1:3,each=2), b=rnorm(6))

produces

> x1[y1, min(b), on="a", by=.EACHI]
   a          V1
1: 1  0.89748590
2: 1  0.54633354
3: 2 -0.30981258
4: 2 -0.07317263
5: 3 -1.37440167
6: 3  0.17736975
> y1[x1, min(b), on="a", by=.EACHI]
   a         V1
1: 1  0.5463335
2: 2 -0.3098126

That's why I thought it does NOT work like SQL (where it doesn't matter if I join x1 with y1 or y1 with x1 - the result of the expressions is always the same).

jangorecki commented 9 years ago

It does matter in SQL if you use outer join, if it is left or right outer. data.table uses right outer join by default to be consistent with R subsetting. You can use sqlfiddle to easily produce SQL examples that may help in that discussion. BTW. I don't understand how the n:1 refers to SQL.

franknarf1 commented 9 years ago

Getting sidetracked further. Ok, I guess we were using different terminology and can leave it at that.

Just to explain my reading: I read n:1 as many-to-one, as defined here: http://mathworld.wolfram.com/Many-to-One.html And 1:n is the opposite. I'm not very SQL literate, but this is also the sense in which it is used in Stata's merge command (which I have to use on a daily basis, so it drives my thinking).

aryoda commented 9 years ago

@jangorecki:

data.table uses right outer join by default to be consistent with R subsetting

Thx, that is exactly the (shortest possible :-) explanation

arunsrinivasan commented 8 years ago

Update the SO Q linked in https://github.com/Rdatatable/data.table/issues/1519 once implemented.

sch56 commented 8 years ago

Definitely keen to see this. Have just been pointed to this request by @Frank after posting on stackoverflow http://stackoverflow.com/questions/36974145/r-data-table-group-by-attribute-of-i-table?noredirect=1#comment61505368_36974145.

franknarf1 commented 8 years ago

SO q to update/answer after this feature is implemented: http://stackoverflow.com/questions/37465117/r-data-table-performing-a-by-aggregation-on-a-joined-column

bramvisser commented 8 years ago

This request would also solve my SO question here (added because it might provide clarity): http://stackoverflow.com/questions/38824705/using-on-and-by-to-compute-a-new-variable-from-two-data-tables

jsams commented 7 years ago

Wrote up this FR, then thought of a search term that might work better, and came across this issue. I think I'm requesting the same thing as everyone else here, but in case not, here's my writeup so it can be split out to a new issue (seems a waste to delete at this point):

Akrun and I were going back and forth on SO on this, and it seems like this is something that I assumed was supported but does not appear to be and requires some slightly awkward workarounds

https://stackoverflow.com/questions/47150744/data-table-aggregate-join-and-assign-by-reference/

To summarize that discussion here, consider the two datasets,

set.seed(24)
dta = data.table(i=1:4, x=rnorm(4))
dts = data.table(i=rep(1:3, each=3), z=runif(9))

where I want to aggregate dts and assign to dta.

I had thought

dta[dts, sum_z := sum(i.z), by=.(i), on=.(i)]

would work, but it doesn't. I'm actually really quite surprised by this, as it seems to fit very naturally with my conception of the syntax and how data.table works. Though perhaps my mental model of what is happening is not quite right (not surprising, given that I have never tried learning the internals). Note that using by=.EACHI does not produce the correct result. Instead, some kind of nested expression has to be used

dta[dts[, .(z=sum(z)), keyby=i], sum_z1 := i.z, on=.(i)]
dta[dts[dta, .(z=sum(z)), by=.EACHI, on=.(i)], sum_z2 := i.z, on=.(i)]
dta[, sum_z3 := dts[.SD, .(z=sum(z)), by=.(i), on=.(i)]$z]
  1. this seems the most natural, but if dts contains many observations not in dta, may be inefficient
  2. this takes care of unnecessary computations when dts has join observations not in dta, but requires two joins (though maybe its very efficient if everything ends up sorted appropriately)
  3. this one only requires the one join and filters on dta, but seems harder to read and is quite fragile. e.g. if I initialize dta with i out of order, this just doesn't work, even after re-ordering.
> dta = data.table(i=c(4,1,3,2), x=rnorm(4))
> dta[, sum_z3 := dts[.SD, .(z=sum(z)), by=.(i), on=.(i)]$z]
Error in set(i, j = lc, value = newval) : 
  .SD is locked. Updating .SD by reference using := or set are reserved for future use. Use := in j directly. Or use copy(.SD) as a (slow) last resort, until shallow() is exported.
> setkey(dta, i)
> dta[, sum_z3 := dts[.SD, .(z=sum(z)), by=.(i), on=.(i)]$z]
Error in set(i, j = lc, value = newval) : 
  .SD is locked. Updating .SD by reference using := or set are reserved for future use. Use := in j directly. Or use copy(.SD) as a (slow) last resort, until shallow() is exported.

when trying to figure out what was happening and before I played with the workarounds, I think I read somewhere that it is assumed that the i in the x[i, j, by] syntax is assumed to be the smaller table and the join is computed based on that. That's a strong (and imo weird) assumption that I think is trickling in to why this doesn't Just Work. I guess it is motivated by the notion of the i being a subset operator, but when you start thinking about things in terms of joins and aggregations, that analogy falls apart.

Finally, while it is natural to join and aggregate on the same column, it's not too hard to think of situations where you might want to join on one set of columns (say, city and state), and then aggregate on another (say, county).

(If you couldn't tell, this is a feature request to support the syntax proposed above)

Don't think it's necessary, but in case you were wondering in what version those .SD errors were occurring:

> sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 17.04

Matrix products: default
BLAS: /usr/lib/openblas-base/libblas.so.3
LAPACK: /usr/lib/libopenblasp-r0.2.19.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.10.4

loaded via a namespace (and not attached):
[1] compiler_3.4.2
mattdowle commented 7 years ago

Raised here as well : https://stackoverflow.com/questions/47319250/r-data-table-j-is-valid-only-when-there-is-no-by

sritchie73 commented 6 years ago

@mattdowle flagging here that this is something that will come up in the Data Camp course on joins with data.table. It's non-critical / non-blocking, but would be nice to to have implemented.

Kamgang-B commented 2 months ago

I have just missed this feature. I also found a recent question about this problem at the end of this SO question. It would be nice to have this feature implemented.

jangorecki commented 2 months ago

Please use upvotes to keep this long standing issue more readable