Closed geneorama closed 1 month ago
.SD and .SDcol will do the job.
R> head(dt)
ymd london pairs berlin
1: 1900-01-01 12:00:00 0.62158 0.8151 0.2893
2: 1900-01-02 12:00:00 0.09772 0.7228 0.5576
3: 1900-01-03 12:00:00 0.65804 0.8039 0.9895
4: 1900-01-04 12:00:00 0.87387 0.2731 0.1960
5: 1900-01-05 12:00:00 0.75414 0.4138 0.4678
6: 1900-01-06 12:00:00 0.60392 0.6056 0.2084
R> lm(london ~ ., dt[, .SD, .SDcol = -key(dt)])
Call:
lm(formula = london ~ ., data = dt[, .SD, .SDcol = -key(dt)])
Coefficients:
(Intercept) pairs berlin
0.50158 0.00231 -0.00230
enjoy native data.table :)
yi-tang
I just saw your response, and thank you! However, I still think it would still be useful (simpler and easy to read) to have a function that returns a data.table without the key; similar to the coredata
function in the package zoo
.
You make a great argument to simply rely on the native functionality, and this is probably a question of design. I think the coredata (or whatever) function would be nice, but I can see the other side here too.
BUT, I would much prefer dt[, .SD, .SDcol = -key(dt)]
over dt[ , -which(colnames(dt) %in% key(dt)), with=FALSE]
, so thanks for that! I'll definitely use that over the original (but I would still personally prefer coredata(dt)
or even keyless(dt)
.
-Gene
Gene, I've marked as FR, but at the moment, I don't see a reason "for". It seems reasonable to me to write your own function, as it's a very special case of a subset operation. Are there other compelling cases where you need this?
@geneorama What you suggests is a simple wrapper
keyless <- function(x) x[, .SD, .SDcol = -key(x)]
I understand there are cases where it is useful but data.table is still more focused on providing wide and efficient table data manipulation framework than direct function to achieve something basic as above. If you strongly believe it should be included in master you can try PR :+1:
After six months I seem to be the only one who thinks this is a good idea, so I'll just stick with a custom function
It doesn't seem like a bad idea to me. No objection to adding it. Not sure the best name. Would we need to select the key columns only sometimes as well - what would that function be called? key()
already used so maybe keycolumns()
and valuecolumns()
, or keydata()
and valuedata()
? Hm.
I was going to explain how I thought it was a bad idea... but my rechanged (?) my mind, and the example I worked out turned out to validate my original suggestion.
I think it could be confusing with .SDcols
but it could be pretty useful otherwise.
This is an example of a pretty typical workflow for me;
EDIT: Also, I called it dekey
... but I don't love that name either. You wouldn't want a devalue
function, right? The zoo
library uses coredata
, which I don't like but can't beat.
library(data.table)
set.seed(1)
data_full <- data.table(mykey = letters,
group = c(rep("train", 10), rep("test", 16)),
x1 = rnorm(26), x2 = rnorm(26), x3 = rnorm(26), x4 = rnorm(26),
x5 = rnorm(26), x6 = rnorm(26), x7 = rnorm(26), x8 = rnorm(26),
y = sample(c(0,1), 26,replace=T),
key = c("mykey", "group"))
dekey <- function(x) x[, .SD, .SDcol = -key(x)]
## Regress on some different column subsets
## Perhaps create copies of the subsets for future plotting and analysis
d1 <- data_full[ , list(x2,x4,x6,x8,y), keyby=list(mykey, group)]
d2 <- data_full[ , list(x1,x3,x5,y), keyby=list(mykey, group)]
glm1 <- glm(y ~ ., data = dekey(d1[group=="test"]), family = "binomial")
glm2 <- glm(y ~ ., data = dekey(d2[group=="test"]), family = "binomial")
## To create a data.table of predictions the keys have to be added back,
## and we're relying on the data being in the same order
pred1 <- data.table(data_full[ , list(mykey, group)],
yhat = predict(glm1, data_full),
key = c("mykey", "group"))
pred2 <- data.table(data_full[ , list(mykey, group)],
yhat = predict(glm2, data_full),
key = c("mykey", "group"))
## Merge in predictions as needed
data_full[pred1]
data_full[pred2]
## Merge in predictions as needed e.g. for plotting
library(ggplot2)
ggplot(data_full[pred1]) + aes(x2, yhat, colour = group) + geom_point(size=9)
ggplot(data_full[pred2]) + aes(x2, yhat, colour = group) + geom_point(size=9)
How about getDT
? Would it be a good idea to have one function with the following arguments -
x
: The data.table
.i
: Rows to be subset, NULL
by default indicating all rows.
j
: Can be a character
vector of column names or integer
vector of column positions or one of "key"
or "value"
.Quick follow-up on this issue, does anyone have suggestions on how to best close this issue?
I opened the issue to see what people thought, and a decade later I think it's safe to close the polls.
It would be nice to be able to return a data.table on the fly without the key.
This would be useful for things like regressions where you might want to keep a key, but you don't want to include it in the regression. Of course I could use my own function, but I would prefer to use something standard.
Example function
Perhaps there's something more elegant / obvious?
Example usage:
I know that others have mentioned this, but I couldn't find an existing issue.
Thank you