Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.62k stars 986 forks source link

Add syntax for "subsetting join" #2158

Open franknarf1 opened 7 years ago

franknarf1 commented 7 years ago

With X[!Y], we can subset to where X has rows not matching Y, but there is no analogue for subsetting to where X does match Y. A X[Y, nomatch=0] join will sort the result according to Y and recognize dupe rows in Y, so I need to do something like X[sort(unique(X[Y, nomatch=0, which=TRUE]))] instead (unless I'm forgetting some other way).

library(data.table)
X = data.table(id = c(3L, 1L, 2L, 1L, 1L), g = c("A", "A", "B", "B", "A"), v = (1:5)*10)
Y = data.table(id = c(1L, 1:3), g = "A")

X[Y, on=names(Y), nomatch=0] 
# gives id = 1 before 3, contrary to X's ordering
# gives id = 1 twice, reflecting Y, but the goal is to subset X

#    id g  v
# 1:  1 A 20
# 2:  1 A 50
# 3:  1 A 20
# 4:  1 A 50
# 5:  3 A 10

X[ sort(unique(X[Y, on=names(Y), nomatch=0, which=TRUE])) ]
# desired result

#    id g  v
# 1:  3 A 10
# 2:  1 A 20
# 3:  1 A 50

X[!Y, on=names(Y)]
# analogous much simpler code for not join

#    id g  v
# 1:  2 B 30
# 2:  1 B 40

So I'm looking for new syntax to make this less awkward, maybe something like

X[Y, on=names(Y), subset.join = TRUE]
# or
X[subset.join(Y), on=names(Y)]
arunsrinivasan commented 7 years ago

Are you sure you're not looking for:

Y[X, on=names(Y), nomatch=0L, mult="first"]
franknarf1 commented 7 years ago

@arunsrinivasan Ah, yeah, that does work for my original example. However, if I want to subset X2 that has different join-column names from Y, when I try to subset rows of it in this way...

X2 = setnames(copy(X), "g", "g2")
Y[X2, on=.(id, g = g2), nomatch=0, mult="first"]

   id g  v
1:  3 A 10
2:  1 A 20
3:  1 A 50

The goal was to subset X2, but now I have the wrong column names.

franknarf1 commented 7 years ago

Possible dupe of #915 , though I don't like the suggested syntax there.


Qs from SO to edit:

https://stackoverflow.com/questions/47062424/great-than-sign-and-quosure-producing-error

Could do...

temp_df[ .(as.Date("2011-12-31"), as.Date("2011-01-01")), 
  on=paste0(c(startdate, enddate), c("<=", ">="), c("V1", "V2")), subset.join = TRUE]

Mais um https://stackoverflow.com/q/50249882/

Another, I think https://stackoverflow.com/q/56999284