Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.62k stars 985 forks source link

Ability to have a "Surv" class data column in a data.table object #1204

Open dataPulverizer opened 9 years ago

dataPulverizer commented 9 years ago
## Feature request: ability to have a "Surv" class data column in a data.table object
## Currently using data.table version 1.9.4 on R version 3.2.0

library(data.table)
library(survival)

## Lung dataset from the {survival} package
lung1 <- lung

## Create a "Surv" class column
lung1$Surv <- with(lung, Surv(time, status))

head(lung1)
#  inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss  Surv
#1    3  306      2  74   1       1       90       100     1175      NA  306 
#2    3  455      2  68   1       0       90        90     1225      15  455 
#3    3 1010      1  56   1       0       90        90       NA      15 1010+
#4    5  210      2  57   1       1       90        60     1150      11  210 
#5    1  883      2  60   1       0      100        90       NA       0  883 
#6   12 1022      1  74   1       1       50        80      513       0 1022+

## Convert data frame to data table
lung1 <- data.table(lung1)

## Error from the "Surv" column
head(lung1)
#Error in `[.Surv`(x, , 2) : 
#  invalid to set the class to matrix unless the dimension attribute is of length 2 (was 0)

Thank you

tdhock commented 7 years ago

I just downloaded data.table from GitHub. I can confirm that this is still an issue. Here is my session info.

thocking@silene:~/R$ R --vanilla

R version 3.3.2 (2016-10-31) -- "Sincere Pumpkin Patch"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(survival)
> library(data.table)
data.table 1.10.5 IN DEVELOPMENT built 2017-02-16 18:03:14 UTC
  The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
  Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
  Release notes, videos and slides: http://r-datatable.com
> data.table(surv=Surv(1, 5, type="interval2"))
Error in `[.Surv`(x, , 3) : 
  invalid to set the class to matrix unless the dimension attribute is of length 2 (was 0)
> devtools::session_info()
Session info -------------------------------------------------------------------
 setting  value                       
 version  R version 3.3.2 (2016-10-31)
 system   x86_64, linux-gnu           
 ui       X11                         
 language en_CA:en                    
 collate  en_CA.UTF-8                 
 tz       <NA>                        
 date     2017-02-16                  

Packages -----------------------------------------------------------------------
 package    * version     date       source                                
 data.table * 1.10.5      2017-02-16 Github (Rdatatable/data.table@9fadbcd)
 devtools     1.12.0.9000 2016-08-12 Github (hadley/devtools@565ac15)      
 digest       0.6.10      2016-08-02 CRAN (R 3.2.3)                        
 lattice      0.20-34     2016-09-06 CRAN (R 3.3.2)                        
 Matrix       1.2-7.1     2016-09-01 CRAN (R 3.3.2)                        
 memoise      1.0.0       2016-01-29 CRAN (R 3.2.3)                        
 survival   * 2.40-1      2016-10-30 CRAN (R 3.3.2)                        
 withr        1.0.2       2016-06-20 CRAN (R 3.2.3)                        
> 
pham000 commented 3 months ago

Thanks for all the work on data.table -- the package has been indispensable.

However, this issue forces me transition out of a data.table whenever survival data is involved. Any possible updates on a fix? Thanks!

# with data.table 1.15.0

> dt = data.table(survival::myeloma)
> dt[, surv := survival::Surv(futime, death)]
Error in `[.data.table`(dt, , `:=`(surv, survival::Surv(futime, death))) : 
  Supplied 7764 items to be assigned to 3882 items of column 'surv'. If you wish to 'recycle' the RHS please use rep() to make this intent clear to readers of your code.
In addition: Warning message:
In `[.data.table`(dt, , `:=`(surv, survival::Surv(futime, death))) :
  2 column matrix RHS of := will be treated as one vector
tdhock commented 3 months ago

confirming this is still an issue on master, and actually I get a new error (stack overflow) from when running my old example with type="interval2" on windows:

> library(data.table)
data.table 1.15.99 IN DEVELOPMENT built 2024-08-07 15:42:29 UTC using 3 threads (see ?getDTthreads).  Latest news: r-datatable.com
> data.table(surv=Surv(1, 5, type="interval2"))
Error in as.data.frame.model.matrix(x, ...) : node stack overflow
> dt = data.table(survival::myeloma)
> dt[, surv := survival::Surv(futime, death)]
Error in `[.data.table`(dt, , `:=`(surv, survival::Surv(futime, death))) : 
  Supplied 7764 items to be assigned to 3882 items of column 'surv'. If you wish to 'recycle' the RHS please use rep() to make this intent clear to readers of your code.
In addition: Warning message:
In `[.data.table`(dt, , `:=`(surv, survival::Surv(futime, death))) :
  2 column matrix RHS of := will be treated as one vector
tdhock commented 3 months ago

the underlying issue is that a Surv object is either a Nx3 numeric array

> str(Surv(1, 5, type="interval2"))
 'Surv' num [1, 1:3] [1, 5]
 - attr(*, "dimnames")=List of 2
  ..$ : NULL
  ..$ : chr [1:3] "time1" "time2" "status"
 - attr(*, "type")= chr "interval"

or a Nx2 numeric array

> str(Surv(1:2, 0:1))
 'Surv' num [1:2, 1:2] 1+ 2 
 - attr(*, "dimnames")=List of 2
  ..$ : NULL
  ..$ : chr [1:2] "time" "status"
 - attr(*, "type")= chr "right"

this issue is tagged "non-atomic column" but actually Surv is atomic:

> is.atomic(Surv(1, 5, type="interval2"))
[1] TRUE

it just has a custom length method:

> survival:::length.Surv
function (x) 
nrow(x)
<bytecode: 0xb3211b8>
<environment: namespace:survival>
> length(Surv(1, 5, type="interval2"))
[1] 1
> length(as.matrix(Surv(1, 5, type="interval2")))
[1] 3

data.table sees matrix length=3, not Surv length=1, which I believe is a bug, but it is not clear to me if there should be support for this, from the ?data.table docs. The only mention I see of length is:

     ...: Just as ‘...’ in data.frame. Usual recycling rules are
          applied to vectors of different lengths to create a list of
          equal length vectors.
pham000 commented 3 months ago

Thanks for looking back into it!

tdhock commented 3 months ago

my stack overflow turned into a "too close to the limit" error on linux, which looks like this

> data.table(x=1:2,y=Surv(5:6,7:8,type="interval2"))
Erreur : C stack usage  9521904 is too close to the limit
> traceback()
1621: mode(expr)
1620: mode(expr) %in% c("call", "expression", "(", "function")
1619: deparse(substitute(x))
1618: as.data.frame.model.matrix(x, ...)
1617: as.data.frame.Surv(x, ...)
1616: as.data.frame(x, ...)
1615: as.data.table(as.data.frame(x, ...), ...)
1614: as.data.table.default(xi, keep.rownames = keep.rownames)
1613: as.data.table(xi, keep.rownames = keep.rownames)
...
8: as.data.table(xi, keep.rownames = keep.rownames)
7: as.data.table.list(as.list(x), keep.rownames = keep.rownames, 
       ...)
6: as.data.table.data.frame(as.data.frame(x, ...), ...)
5: as.data.table(as.data.frame(x, ...), ...)
4: as.data.table.default(xi, keep.rownames = keep.rownames)
3: as.data.table(xi, keep.rownames = keep.rownames)
2: as.data.table.list(x, keep.rownames = keep.rownames, check.names = check.names, 
       .named = nd$.named)
1: data.table(x = 1:2, y = Surv(5:6, 7:8, type = "interval2"))
tdhock commented 3 months ago

this is an inefficient work-around/hack, but you can use a list column with the current code (each list element is Surv with one observation).

> (myeloma_dt <- data.table(myeloma)[, list_of_Surv := split(Surv(futime,death),.I)][])
         id  year entry futime death list_of_Surv
      <int> <int> <int>  <int> <int>       <list>
   1:     1    57     0   1431     1         1431
   2:     2    61     0    686     1          686
   3:     3    53     0   6270     1         6270
   4:     4    66     0    365     1          365
   5:     5    67     0   1340     1         1340
  ---                                            
3878:  3910    95    40     42     0          42+
3879:  3911    95   347    348     0         348+
3880:  3912    96    28     31     0          31+
3881:  3913    95   221    223     0         223+
3882:  3914    94   497    498     0         498+

Then you would have to use do.call with c to get a regular Surv back, as below.

> str(do.call(c, myeloma_dt$list_of_Surv))
 'Surv' num [1:3882, 1:2] 1431  686 6270  365 1340 1567 3797    5  121  822 ...
 - attr(*, "dimnames")=List of 2
  ..$ : NULL
  ..$ : chr [1:2] "time" "status"
 - attr(*, "type")= chr "right"
> myeloma_dt[c(1,2,3880), do.call(c, list_of_Surv)]
[1] 1431   686    31+
tdhock commented 3 months ago

data.table supports other kinds of custom columns (bit64, nanotime, xts) so it seems like in principle Surv could be.

pham000 commented 3 months ago

Thanks -- I think the workaround will work for me for now! I'll have to digest what it's doing a bit... Thanks again for the help/advice.