Open MichaelChirico opened 2 months ago
Preventing the loss of information sounds reasonable but like you mentioned this change could potentially break existing code that relies on .SD
excluding all by
columns. Since this behavior isn't necessarily a bug and more of a design decision in improving intuitiveness, I think whether to change it or not would depend on things like maintaining backward compatibility. (So my two cents here would be to have an argument-controlled option to conditionally include such columns in .SD
after it's tested to not be a breaking change)
i never noticed this. I do use the ad-hoc by, but in that case I don't want the original variable. (it's a feature, not a bug)
I couldn't resist to try it myself with two variables
> data.table(x=1:2, y=3:4, z=5:6)[, .SD, by=.(xy=paste(x,y))]
xy z
<char> <int>
1: 1 3 5
2: 2 4 6
I believe this is also a feature not a bug.
I think I agree, since we can fairly easily add it to by to keep it:
data.table(x=1:2, y=3:4, z=5:6)[, .SD, by=.(xy=paste(x,y), x)]
we can fairly easily add it to by to keep it
That changes the computation, though -- in my case, .(trim0 = foo(a))
, trim0
has fewer unique values than a
in that case I don't want the original variable.
I'm not sure what's more common, but certainly I filed this bug because I wanted the original variable.
Some more context: I had filepaths: DT=data.table(filename = ...)
, I wanted to return the full path for any path that is the only one in its (top-level) directory:
DT[, if (.N == 1) filename, by=.(tld = sub("/.*", "", filename))]
That works well, but I also wanted some other metadata associated with each filename
, hence wanting .SD
.
Good point, when you are simplifying the variable, adding the original to by
would be problematic. And your additional context helps. Something like you are requesting would help in that case reduce the amount you'd have to do in j
Right. It seems to me much easier (and readable) to say "I don't want filename
in .SD
" by doing .SDcols = !"filename"
than it is to recover filename
in my use case (c(list(filename = filename), .SD)
-- always hate that).
Yeah, done that a ton of times and hated it every time. This is obviously more about by
behavior but the more feature rich .SDcols
continues to get, the better.
As for this one, it might be too much of an edge case for an argument (not sure) and changing the default is likely to break a decent amount of code. Would be open to others opinions/data needs.
.SDcols = patterns("")
is a way to retrieve all columns, even those in by
?.SD clearly documents the existing behavior
• ‘.SD’ is a data.table containing the *S*ubset of ‘x’'s
*D*ata for each group, excluding any columns used in ‘by’ (or
‘keyby’).
It is good idea to work about this inconsistency when columns are used in by
in an expression, but IMO that would be heavily affecting existing code, therefore adding 2.0 milestone.
I might have expected
.SD
to includea
here, sincea
is not inby
-- it' just used inby
to generate a new ad hoc column. Such mappings need not be invertible, as is the case here, so we can't be surea
could be recovered fromtrim0
, i.e., the current behavior generates some loss of information.In this simple table, the workaround is pretty easy (
.SD[1]
-->.(a = a[1], b = b[1])
, but this will quickly get ugly.If we changed behavior so that
a
is in.SD
, getting the old behavior back (if so desired) would mean doing.SDcols = !"a"
, i.e., the workaround for that default would be a lot easier/more "canonical".Then again I'm not sure if it's worth introducing a potentially breaking change for this, or if it's a bug, or if anyone else even agrees about this behavior being strange.