Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.58k stars 977 forks source link

Convert from `data.table` to `data.frame`/`matrix` helper functions #5382

Open dereckmezquita opened 2 years ago

dereckmezquita commented 2 years ago

Could I offer some of these functions as helpers which would cut down on some of the verbosity of writing code which uses data.table?

I'll be using this data as an example dataset:

dt = iris

data.table::setDT(dt)
dt[, sample := paste(dt$Species, 1:nrow(dt), sep = " ")]

Setting a matrix by reference

I find myself often working with data.frames and matrix type objects, we currently have a setDF function but no "setMatrix"/"setMT" equivalent.

setMT = function(x, rownames = NULL) {
    return(as.matrix(setDF(x = x, rownames = rownames)))
}

setMT(dt, rownames = dt$sample)

to.X family of functions but move a column to rownames

Here I propose a family of functions which would allow one to convert to a certain class, data.frame or matrix, but move one of the columns to its rownames.

This is useful because again I find myself working with data.frames a lot when interacting with base R/packages but since data.table doesn't allow rownames I have to keep this information as a column and then move it as such:

data.table::setDF(dt, rownames = dt$sample)

dt$sample = NULL

I propose to simplify this to a single function call which could move the column to the rownames of the resulting object.

Convert to a data.frame

to.data.frame = function(x, id.col = NULL, drop.id.col = TRUE, ...) {
    ans <- data.table::copy(x)

    if(!is.null(id.col)) {
        if(!id.col %in% colnames(ans)) {
            rlang::abort(stringr::str_interp('Column "${id.col}" not found.'))
        }

        data.table::setDF(ans, rownames = ans[, get(id.col)])

        if(drop.id.col) {
            ans[, id.col] = NULL
        }
    } else {
        data.table::setDF(ans)
    }

    return(ans)
}

Thus converting to a data.frame with rownames is simplified to:

to.data.frame(dt, id.col = "sample")

Convert to a matrix

to.matrix = function(x, id.col = NULL, drop.id.col = TRUE, ...) {
    return(as.matrix(to.data.frame(x, id.col = id.col, drop.id.col = drop.id.col)))
}
to.matrix(dt, id.col = "sample")

sessionInfo()

sessionInfo()
R version 4.1.3 (2022-03-10)
Platform: aarch64-apple-darwin21.3.0 (64-bit)
Running under: macOS Monterey 12.2.1

Matrix products: default
LAPACK: /opt/homebrew/Cellar/r/4.1.3/lib/R/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] datk_0.0.1

loaded via a namespace (and not attached):
 [1] ComplexHeatmap_2.10.0 compiler_4.1.3        pillar_1.7.0          RColorBrewer_1.1-3    iterators_1.0.14     
 [6] tools_4.1.3           digest_0.6.29         lifecycle_1.0.1       tibble_3.1.7          gtable_0.3.0         
[11] clue_0.3-60           pkgconfig_2.0.3       png_0.1-7             rlang_1.0.2           foreach_1.5.2        
[16] DBI_1.1.2             cli_3.3.0             microbenchmark_1.4.9  parallel_4.1.3        stringr_1.4.0        
[21] dplyr_1.0.9           cluster_2.1.3         generics_0.1.2        vctrs_0.4.1           GlobalOptions_0.1.2  
[26] S4Vectors_0.32.4      IRanges_2.28.0        tidyselect_1.1.2      stats4_4.1.3          grid_4.1.3           
[31] glue_1.6.2            data.table_1.14.2     R6_2.5.1              GetoptLong_1.0.5      fansi_1.0.3          
[36] purrr_0.3.4           ggplot2_3.3.6         magrittr_2.0.3        scales_1.2.0          codetools_0.2-18     
[41] matrixStats_0.62.0    ellipsis_0.3.2        BiocGenerics_0.40.0   assertthat_0.2.1      shape_1.4.6          
[46] circlize_0.4.15       colorspace_2.0-3      utf8_1.2.2            stringi_1.7.6         doParallel_1.0.17    
[51] munsell_0.5.0         crayon_1.5.1          rjson_0.2.21         
jangorecki commented 2 years ago

Hi, thank you for code and proposal. Convert to/from matrix by reference is not possible, therefore set* should be avoided. DF is a collection of C arrays, each column is a separate array. Matrix is a single C array, where attributes defines it's shape. I think it is better to improve existing methods rather than adding new functions.