Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.61k stars 984 forks source link

Should .SD include a column used to create an ad-hoc 'by'? #6388

Open MichaelChirico opened 2 months ago

MichaelChirico commented 2 months ago
DT = data.table(a = rep(letters, 2), b = 1)
DT[1:20, a := paste0(0, a)]
DT[15:26, a := paste0(a, 1)]
DT[, if (.N == 2L) .SD[1], by = .(trim0 = gsub("^0", "", a))]
#      trim0     b
#     <char> <num>
#  1:      a     1
#  2:      b     1
#  3:      c     1
#  4:      d     1
#  5:      e     1
#  6:      f     1
#  7:      g     1
#  8:      h     1
#  9:      i     1
# 10:      j     1
# 11:      k     1
# 12:      l     1
# 13:      m     1
# 14:      n     1

I might have expected .SD to include a here, since a is not in by -- it' just used in by to generate a new ad hoc column. Such mappings need not be invertible, as is the case here, so we can't be sure a could be recovered from trim0, i.e., the current behavior generates some loss of information.

In this simple table, the workaround is pretty easy (.SD[1] --> .(a = a[1], b = b[1]), but this will quickly get ugly.

If we changed behavior so that a is in .SD, getting the old behavior back (if so desired) would mean doing .SDcols = !"a", i.e., the workaround for that default would be a lot easier/more "canonical".

Then again I'm not sure if it's worth introducing a potentially breaking change for this, or if it's a bug, or if anyone else even agrees about this behavior being strange.

Anirban166 commented 2 months ago

Preventing the loss of information sounds reasonable but like you mentioned this change could potentially break existing code that relies on .SD excluding all by columns. Since this behavior isn't necessarily a bug and more of a design decision in improving intuitiveness, I think whether to change it or not would depend on things like maintaining backward compatibility. (So my two cents here would be to have an argument-controlled option to conditionally include such columns in .SD after it's tested to not be a breaking change)

tdhock commented 2 months ago

i never noticed this. I do use the ad-hoc by, but in that case I don't want the original variable. (it's a feature, not a bug)

tdhock commented 2 months ago

I couldn't resist to try it myself with two variables

> data.table(x=1:2, y=3:4, z=5:6)[, .SD, by=.(xy=paste(x,y))]
       xy     z
   <char> <int>
1:    1 3     5
2:    2 4     6

I believe this is also a feature not a bug.

TysonStanley commented 2 months ago

I think I agree, since we can fairly easily add it to by to keep it:

data.table(x=1:2, y=3:4, z=5:6)[, .SD, by=.(xy=paste(x,y), x)]
MichaelChirico commented 2 months ago

we can fairly easily add it to by to keep it

That changes the computation, though -- in my case, .(trim0 = foo(a)), trim0 has fewer unique values than a

in that case I don't want the original variable.

I'm not sure what's more common, but certainly I filed this bug because I wanted the original variable.

Some more context: I had filepaths: DT=data.table(filename = ...), I wanted to return the full path for any path that is the only one in its (top-level) directory:

DT[, if (.N == 1) filename, by=.(tld = sub("/.*", "", filename))]

That works well, but I also wanted some other metadata associated with each filename, hence wanting .SD.

TysonStanley commented 2 months ago

Good point, when you are simplifying the variable, adding the original to by would be problematic. And your additional context helps. Something like you are requesting would help in that case reduce the amount you'd have to do in j

MichaelChirico commented 2 months ago

Right. It seems to me much easier (and readable) to say "I don't want filename in .SD" by doing .SDcols = !"filename" than it is to recover filename in my use case (c(list(filename = filename), .SD) -- always hate that).

TysonStanley commented 2 months ago

Yeah, done that a ton of times and hated it every time. This is obviously more about by behavior but the more feature rich .SDcols continues to get, the better.

As for this one, it might be too much of an edge case for an argument (not sure) and changing the default is likely to break a decent amount of code. Would be open to others opinions/data needs.

rikivillalba commented 2 months ago

.SDcols = patterns("") is a way to retrieve all columns, even those in by

tdhock commented 2 months ago

?.SD clearly documents the existing behavior

        • ‘.SD’ is a data.table containing the *S*ubset of ‘x’'s
          *D*ata for each group, excluding any columns used in ‘by’ (or
          ‘keyby’).
jangorecki commented 1 month ago

It is good idea to work about this inconsistency when columns are used in by in an expression, but IMO that would be heavily affecting existing code, therefore adding 2.0 milestone.