Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.62k stars 985 forks source link

Allow a character string in `which` argument of `data.table:::[.data.table` #6496

Open Kamgang-B opened 2 months ago

Kamgang-B commented 2 months ago

This is a feature request.

I find which argument quite confusing/counterintuitive when joining and returning i row numbers in x[i, which=NA, ...].

A join and which argument can interact in four different ways as shown below:

x = data.table(a=1:3, x=c(NA, 10, NA))
i = data.table(a=2:5, y=c(20, 10, 20, 30))

x
       a     x
   <int> <num>
1:     1    NA
2:     2    10
3:     3    NA

i
       a     y
   <int> <num>
1:     2    20
2:     3    10
3:     4    20
4:     5    30

x[i, on="a", which=TRUE]     # (a): ok
[1]  2  3 NA NA
x[!i, on="a", which=TRUE]    # (b): ok
[1] 1
x[i, on="a", which=NA]       # (c): counterintuitive
[1] 3 4
x[!i, on="a", which=NA]      # (d): counterintuitive
[1] 1 2

(a): row numbers of x that i matches to. (b): row numbers of x that no i matches to. (c): row numbers of i that have no match to x. The fact that i is not prefixed with ! makes it counterintuive. (d): row numbers of i that have a match to x. The use of ! suggests that the cases that have no match are of interest while it is actually the opposite.

I propose to allow a character string in which with four possible values (other propositions are very welcome): c("xmatch", "xnomatch", "imatch", "inomatch") where they correspond to (a), (b), (d), and (c) scenarios, respectively. These values would work as follow:

x[i, on="a", which="xmatch"]     # row number of x that i matches to
x[i, on="a", which="xnomatch"]   # row numbers of x that no i matches to
x[i, on="a", which="imatch"]     # row numbers of i that have a match to x
x[i, on="a", which="inomatch"]   # row numbers of i that have no match to x

So, the character string specified would allow to know the type of join (whether i needs to be prefixed with ! or not) and the data.table whose row numbers should be returned.

With this feature, data.table:::[.data.table would behave as below:

fm = function(x, i, on, which){
  switch(which,
     xmatch = x[i, on=on, which=TRUE],
     xnomatch = x[!i, on=on, which=TRUE],
     inomatch = x[i, on=on, which=NA],
     imatch = x[!i, on=on, which=NA])
}

fm(A, B, on="a", which="xmatch")
[1]  2  3 NA NA
fm(A, B, on="a", which="xnomatch")
[1] 1
fm(A, B, on="a", which="imatch")
[1] 1 2
fm(A, B, on="a", which="inomatch")
[1] 3 4
jangorecki commented 1 month ago

Is there anything wrong with swapping places of x and i?

x[i,...]
i[x,...]
AngelFelizR commented 1 month ago

I don't see much benefit in adding a function easy to create with the current tools we know.