Open arunsrinivasan opened 10 years ago
@arunsrinivasan I think the title is misleading, it should be "x[i, j, by=.] where i
is a data.table and j
contains expressions from i
should/could work as wel".
That is, change the last x
to i
, because you can already use x variables in the j expression
@juancentro oops, right! Corrected now. Thanks again.
Do I understand it right? You want to be able to access columns of the data.table passed as i parameter in the j parameter while you use a "group by"?
Any plans to implement this feature since it seems to be quite natural?
Compare it to SQL's "select sum( i.valb) from x inner join i on x.idx = i.idx group by x.fidx"!
This is already doable. If the names from x and i don't clash, you can just
use the column names. Otherwise, prefix with i.
On Fri, Nov 6, 2015, 20:11 aryoda notifications@github.com wrote:
Any plans to implement this feature since it seems to be quite natural?
Compare it to SQL's "select sum( i.valb) from x inner join i on x.idx = i.idx group by x.fidx"!
— Reply to this email directly or view it on GitHub https://github.com/Rdatatable/data.table/issues/733#issuecomment-154572285 .
Which column name clash causes this error message (none of the used column names is ambigious)?
# (2)
dta[dtb, sum(valb), by=fidx]
# Error in `[.data.table`(dta, dtb, sum(valb), by = fidx) :
# object 'valb' not found
How could a corrected version of this code could look like using data.table 1.9.6++?
In which version are you trying? This was tested in 1.9.2 I think
On Sat, Nov 7, 2015, 06:23 aryoda notifications@github.com wrote:
Which column name clash causes this error message (none of the used column names is ambigious):
(2)dta[dtb, sum(valb), by=fidx]# Error in
[.data.table
(dta, dtb, sum(valb), by = fidx) : # object 'valb' not found— Reply to this email directly or view it on GitHub https://github.com/Rdatatable/data.table/issues/733#issuecomment-154668301 .
I tested this with data.table 1.9.4 using R3.2.2 running in Ubuntu 64 bit, then updated to data.table 1.9.6 and got the same error message (object 'valb' not found).
# sessionInfo() excerpt:
R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS
...
other attached packages:
[1] data.table_1.9.6
I see the same thing as aryoda on (some iteration of) 1.9.7. by=.EACHI
works fine, but nothing else (not even by="idx"
, the merging column).
Now I remember. This is expected. Before the introduction of EACHI, the same behavior was generated using a data.table in the i-argument (by-without-by). Now you have to be explicit about the grouping (using EACHI)
On Sat, Nov 7, 2015, 11:27 franknarf1 notifications@github.com wrote:
I see the same thing as aryoda on (some iteration of) 1.9.7. by=.EACHI works fine, but nothing else (not even by="idx" (the merging column).
— Reply to this email directly or view it on GitHub https://github.com/Rdatatable/data.table/issues/733#issuecomment-154709881 .
I understand the purpose of .EACHI but it is difficult to explain what I want (and why). Let's try it:
The help function explains the purpose of .EACHI quite good:
"When i is a data.table, DT[i,j,by=.EACHI] evaluates j for the groups in 'DT' that each row in i joins to. That is, you can join (in i) and aggregate (in j) simultaneously. We call this grouping by each i.".
The application of .EACHI requires a n:1 [sic! it should be: 1:n, see comment of franknarf below] relationship between x/dta/DT and the data.table in i otherwise grouping makes no sense because it works on a row-by-row base then.
What I want to do is: I want to join/merge x/dta/DT and i (data tables dta and dtb) into one new data.table and x/dta/DT and i have a 0..1:1 relationship (I want to filter by i/dtb and "enrich" the remaining rows of x/dta/DT with the columns of i/dtb and group them by a colum of i/dtb).
And I want to do this in an memory efficient way (not materializing the complete join and then do the grouping as it would happen with pipelining:
> dta[dtb][, sum(valb), by=fidx]
fidx V1
1: a 10
2: NA 11
So I would expect it a natural behaviour of the data.table that this syntacical variant does the same:
dta[dtb, sum(valb), by=fidx]
And: Yes, this is possibly a feature request, not a bug (as the existing label to this "issue" already implies)
Ok, then you dont need to use by or EACHI, simply do dta[dtb, cola := colb]
On Sat, Nov 7, 2015, 13:08 aryoda notifications@github.com wrote:
I understand the purpose of .EACHI but it is hard explain what I want (and why). Let's try it:
The help function explains the purpose of .EACHI quite good:
"When i is a data.table, DT[i,j,by=.EACHI] evaluates j for the groups in 'DT' that each row in i joins to. That is, you can join (in i) and aggregate (in j) simultaneously. We call this grouping by each i.".
The application of .EACHI requires a n:1 relationship between x/dta/DT and the data.table in i otherwise grouping makes no sense because it works on a row-by-row base then.
What I want to do is: I want to join/merge x/dta/DT and i (data tables dta and dtb) into one new data.table and x/dta/DT and i have a 0..1:1 relationship (I want to filter by i/dtb and "enrich" the remaining rows of x/dta/DT with the columns of i/dtb).
And I want to do this in an memory efficient way (not materializing the complete join and then do the grouping as it would happen with pipelining:
dta[dtb][, sum(valb), by=fidx] fidx V11: a 102: NA 11
— Reply to this email directly or view it on GitHub https://github.com/Rdatatable/data.table/issues/733#issuecomment-154719023 .
OK, I forgot one important thing (I have edited my comment above):
I want to filter by i/dtb and "enrich" the remaining rows of x/dta/DT with the columns of i/dtb ... and group them by one or more colums of i/dtb
@aryoda I don't follow what you're saying about n:1 being mandatory for the x:i when using .EACHI
:
x1 = data.table(a=1:2)
y1 = data.table(a=rep(1:3,each=2), b=rnorm(6))
x1[y1, bmin := min(b), on="a", by=.EACHI]
That is 1:n, right? Anyway, I agree about
And I want to do this in an memory efficient way (not materializing the complete join and then do the grouping as it would happen with pipelining:
and think it's why this issue is still open.
@franknarf1:
You are right, .EACHI is for 1:n, not n:1 (it does not work like SQL). Thx for pointing this out!
Oh, sorry, I meant that as a counterexample to it being only n:1. It can be n:1:
x2 = data.table(a=rep(1:2,each=2))
y2 = data.table(a=1:2, b=4)
x2[y2, d := b, on="a", by=.EACHI]
works.
I don't think there is some 1:n or n:1 restriction.
@franknarf1
Sorry, now I am totally confused (perhaps already off-topic ;-):
x1 = data.table(a=1:2)
y1 = data.table(a=rep(1:3,each=2), b=rnorm(6))
produces
> x1[y1, min(b), on="a", by=.EACHI]
a V1
1: 1 0.89748590
2: 1 0.54633354
3: 2 -0.30981258
4: 2 -0.07317263
5: 3 -1.37440167
6: 3 0.17736975
> y1[x1, min(b), on="a", by=.EACHI]
a V1
1: 1 0.5463335
2: 2 -0.3098126
That's why I thought it does NOT work like SQL (where it doesn't matter if I join x1 with y1 or y1 with x1 - the result of the expressions is always the same).
It does matter in SQL if you use outer join, if it is left or right outer. data.table uses right outer join by default to be consistent with R subsetting. You can use sqlfiddle to easily produce SQL examples that may help in that discussion. BTW. I don't understand how the n:1
refers to SQL.
Getting sidetracked further. Ok, I guess we were using different terminology and can leave it at that.
Just to explain my reading: I read n:1 as many-to-one, as defined here: http://mathworld.wolfram.com/Many-to-One.html And 1:n is the opposite. I'm not very SQL literate, but this is also the sense in which it is used in Stata's merge command (which I have to use on a daily basis, so it drives my thinking).
@jangorecki:
data.table uses right outer join by default to be consistent with R subsetting
Thx, that is exactly the (shortest possible :-) explanation
Update the SO Q linked in https://github.com/Rdatatable/data.table/issues/1519 once implemented.
Definitely keen to see this. Have just been pointed to this request by @Frank after posting on stackoverflow http://stackoverflow.com/questions/36974145/r-data-table-group-by-attribute-of-i-table?noredirect=1#comment61505368_36974145.
SO q to update/answer after this feature is implemented: http://stackoverflow.com/questions/37465117/r-data-table-performing-a-by-aggregation-on-a-joined-column
This request would also solve my SO question here (added because it might provide clarity): http://stackoverflow.com/questions/38824705/using-on-and-by-to-compute-a-new-variable-from-two-data-tables
Wrote up this FR, then thought of a search term that might work better, and came across this issue. I think I'm requesting the same thing as everyone else here, but in case not, here's my writeup so it can be split out to a new issue (seems a waste to delete at this point):
Akrun and I were going back and forth on SO on this, and it seems like this is something that I assumed was supported but does not appear to be and requires some slightly awkward workarounds
https://stackoverflow.com/questions/47150744/data-table-aggregate-join-and-assign-by-reference/
To summarize that discussion here, consider the two datasets,
set.seed(24)
dta = data.table(i=1:4, x=rnorm(4))
dts = data.table(i=rep(1:3, each=3), z=runif(9))
where I want to aggregate dts
and assign to dta
.
I had thought
dta[dts, sum_z := sum(i.z), by=.(i), on=.(i)]
would work, but it doesn't. I'm actually really quite surprised by this, as it seems to fit very naturally with my conception of the syntax and how data.table works. Though perhaps my mental model of what is happening is not quite right (not surprising, given that I have never tried learning the internals). Note that using by=.EACHI
does not produce the correct result. Instead, some kind of nested expression has to be used
dta[dts[, .(z=sum(z)), keyby=i], sum_z1 := i.z, on=.(i)]
dta[dts[dta, .(z=sum(z)), by=.EACHI, on=.(i)], sum_z2 := i.z, on=.(i)]
dta[, sum_z3 := dts[.SD, .(z=sum(z)), by=.(i), on=.(i)]$z]
dts
contains many observations not in dta
, may be inefficientdts
has join observations not in dta
, but requires two joins (though maybe its very efficient if everything ends up sorted appropriately)dta
, but seems harder to read and is quite fragile. e.g. if I initialize dta
with i
out of order, this just doesn't work, even after re-ordering.> dta = data.table(i=c(4,1,3,2), x=rnorm(4))
> dta[, sum_z3 := dts[.SD, .(z=sum(z)), by=.(i), on=.(i)]$z]
Error in set(i, j = lc, value = newval) :
.SD is locked. Updating .SD by reference using := or set are reserved for future use. Use := in j directly. Or use copy(.SD) as a (slow) last resort, until shallow() is exported.
> setkey(dta, i)
> dta[, sum_z3 := dts[.SD, .(z=sum(z)), by=.(i), on=.(i)]$z]
Error in set(i, j = lc, value = newval) :
.SD is locked. Updating .SD by reference using := or set are reserved for future use. Use := in j directly. Or use copy(.SD) as a (slow) last resort, until shallow() is exported.
when trying to figure out what was happening and before I played with the workarounds, I think I read somewhere that it is assumed that the i
in the x[i, j, by]
syntax is assumed to be the smaller table and the join is computed based on that. That's a strong (and imo weird) assumption that I think is trickling in to why this doesn't Just Work. I guess it is motivated by the notion of the i
being a subset operator, but when you start thinking about things in terms of joins and aggregations, that analogy falls apart.
Finally, while it is natural to join and aggregate on the same column, it's not too hard to think of situations where you might want to join on one set of columns (say, city and state), and then aggregate on another (say, county).
(If you couldn't tell, this is a feature request to support the syntax proposed above)
Don't think it's necessary, but in case you were wondering in what version those .SD errors were occurring:
> sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 17.04
Matrix products: default
BLAS: /usr/lib/openblas-base/libblas.so.3
LAPACK: /usr/lib/libopenblasp-r0.2.19.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.10.4
loaded via a namespace (and not attached):
[1] compiler_3.4.2
@mattdowle flagging here that this is something that will come up in the Data Camp course on joins with data.table. It's non-critical / non-blocking, but would be nice to to have implemented.
I have just missed this feature. I also found a recent question about this problem at the end of this SO question. It would be nice to have this feature implemented.
Please use upvotes to keep this long standing issue more readable
Referring to this SO post from @juancentro:
It would be nice to have case (2) returning the right result as well.
That is, as @juancentro and @eantonya mentioned:
would be great!
Thanks to @juancentro for reporting on SO.