Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.57k stars 974 forks source link

Return a data.table without the key #981

Closed geneorama closed 1 month ago

geneorama commented 9 years ago

It would be nice to be able to return a data.table on the fly without the key.

This would be useful for things like regressions where you might want to keep a key, but you don't want to include it in the regression. Of course I could use my own function, but I would prefer to use something standard.

Example function

Perhaps there's something more elegant / obvious?

keyless <- function(x){
    x[ , -which(colnames(x) %in% key(x)), with=FALSE]
}

Example usage:

library(data.table)
## Example using the rock data, with an additional column ID which 
## in a real example may be used to join different data sets.
dt <- data.table(id=paste0("rock", sprintf("%02d", 1:48)), rock)
setkey(dt, id)

## View the structure:
str(dt)

# Classes ‘data.table’ and 'data.frame':  48 obs. of  5 variables:
#  $ id   : chr  "rock01" "rock02" "rock03" "rock04" ...
#  $ area : int  4990 7002 7558 7352 7943 7979 9333 8209 8393 6425 ...
#  $ peri : num  2792 3893 3931 3869 3949 ...
#  $ shape: num  0.0903 0.1486 0.1833 0.1171 0.1224 ...
#  $ perm : num  6.3 6.3 6.3 6.3 17.1 17.1 17.1 17.1 119 119 ...
#  - attr(*, ".internal.selfref")=<externalptr> 
#  - attr(*, "sorted")= chr "id"

## Define "keyless"
keyless <- function(x){
    x[ , -which(colnames(x) %in% key(x)), with=FALSE]
}
## Do a regression
## Obviously we want to exclude the column of identifiers, so we use keyless
lm(area~., keyless(dt))

# Call:
# lm(formula = area ~ ., data = keyless(dt))
# 
# Coefficients:
# (Intercept)         peri        shape         perm  
#    -407.069        2.193     2992.314        2.549  

I know that others have mentioned this, but I couldn't find an existing issue.

Thank you

yitang commented 9 years ago

.SD and .SDcol will do the job.

R> head(dt)
                   ymd  london  pairs berlin
1: 1900-01-01 12:00:00 0.62158 0.8151 0.2893
2: 1900-01-02 12:00:00 0.09772 0.7228 0.5576
3: 1900-01-03 12:00:00 0.65804 0.8039 0.9895
4: 1900-01-04 12:00:00 0.87387 0.2731 0.1960
5: 1900-01-05 12:00:00 0.75414 0.4138 0.4678
6: 1900-01-06 12:00:00 0.60392 0.6056 0.2084

R> lm(london ~ ., dt[, .SD, .SDcol = -key(dt)])

Call:
lm(formula = london ~ ., data = dt[, .SD, .SDcol = -key(dt)])

Coefficients:
(Intercept)        pairs       berlin  
    0.50158      0.00231     -0.00230  

enjoy native data.table :)

geneorama commented 9 years ago

yi-tang

I just saw your response, and thank you! However, I still think it would still be useful (simpler and easy to read) to have a function that returns a data.table without the key; similar to the coredata function in the package zoo.

You make a great argument to simply rely on the native functionality, and this is probably a question of design. I think the coredata (or whatever) function would be nice, but I can see the other side here too.

BUT, I would much prefer dt[, .SD, .SDcol = -key(dt)] over dt[ , -which(colnames(dt) %in% key(dt)), with=FALSE], so thanks for that! I'll definitely use that over the original (but I would still personally prefer coredata(dt) or even keyless(dt).

-Gene

arunsrinivasan commented 9 years ago

Gene, I've marked as FR, but at the moment, I don't see a reason "for". It seems reasonable to me to write your own function, as it's a very special case of a subset operation. Are there other compelling cases where you need this?

jangorecki commented 9 years ago

@geneorama What you suggests is a simple wrapper

keyless <- function(x) x[, .SD, .SDcol = -key(x)]

I understand there are cases where it is useful but data.table is still more focused on providing wide and efficient table data manipulation framework than direct function to achieve something basic as above. If you strongly believe it should be included in master you can try PR :+1:

geneorama commented 9 years ago

After six months I seem to be the only one who thinks this is a good idea, so I'll just stick with a custom function

mattdowle commented 9 years ago

It doesn't seem like a bad idea to me. No objection to adding it. Not sure the best name. Would we need to select the key columns only sometimes as well - what would that function be called? key() already used so maybe keycolumns() and valuecolumns(), or keydata() and valuedata()? Hm.

geneorama commented 9 years ago

I was going to explain how I thought it was a bad idea... but my rechanged (?) my mind, and the example I worked out turned out to validate my original suggestion.

I think it could be confusing with .SDcols but it could be pretty useful otherwise.

This is an example of a pretty typical workflow for me;

EDIT: Also, I called it dekey... but I don't love that name either. You wouldn't want a devalue function, right? The zoo library uses coredata, which I don't like but can't beat.

library(data.table)
set.seed(1)
data_full <- data.table(mykey = letters,
                        group = c(rep("train", 10), rep("test", 16)),
                        x1 = rnorm(26), x2 = rnorm(26), x3 = rnorm(26), x4 = rnorm(26), 
                        x5 = rnorm(26), x6 = rnorm(26), x7 = rnorm(26), x8 = rnorm(26), 
                        y = sample(c(0,1), 26,replace=T), 
                        key = c("mykey", "group"))
dekey <- function(x) x[, .SD, .SDcol = -key(x)]

## Regress on some different column subsets
## Perhaps create copies of the subsets for future plotting and analysis 
d1 <- data_full[ , list(x2,x4,x6,x8,y), keyby=list(mykey, group)]
d2 <- data_full[ , list(x1,x3,x5,y), keyby=list(mykey, group)]

glm1 <- glm(y ~ ., data = dekey(d1[group=="test"]), family = "binomial")
glm2 <- glm(y ~ ., data = dekey(d2[group=="test"]), family = "binomial")

## To create a data.table of predictions the keys have to be added back,
## and we're relying on the data being in the same order
pred1 <- data.table(data_full[ , list(mykey, group)],
                    yhat = predict(glm1, data_full),
                    key = c("mykey", "group"))
pred2 <- data.table(data_full[ , list(mykey, group)],
                    yhat = predict(glm2, data_full),
                    key = c("mykey", "group"))

## Merge in predictions as needed
data_full[pred1]
data_full[pred2]
## Merge in predictions as needed e.g. for plotting
library(ggplot2)
ggplot(data_full[pred1]) + aes(x2, yhat, colour = group) + geom_point(size=9)
ggplot(data_full[pred2]) + aes(x2, yhat, colour = group) + geom_point(size=9)
raneameya commented 5 years ago

How about getDT? Would it be a good idea to have one function with the following arguments -

joshhwuu commented 1 month ago

Quick follow-up on this issue, does anyone have suggestions on how to best close this issue?

geneorama commented 1 month ago

I opened the issue to see what people thought, and a decade later I think it's safe to close the polls.