Bioconductor / SummarizedExperiment

SummarizedExperiment container
https://bioconductor.org/packages/SummarizedExperiment
29 stars 9 forks source link

Constructor could check that assay is in matric format #63

Closed TuomasBorman closed 2 years ago

TuomasBorman commented 2 years ago

Hi,

assay in assays needs to be in matrix format. However, constructor does not check that the assay is in right format

--> you can store DataFrame object as an assay --> because many functions expects assay in matrix format this leads to an error

We have noticed that many beginners make this same mistake. This is quite hard to solve especially when some function uses internal function that expects the data in matrix format, e.g. MatrixGenerics::rowSums2.

I was wondering if constructor could check that assay is in right format. If not, the constructor could throw an error, convert DataFrame into matrix (maybe not), or print a message (maybe the most preferable).

Below is an example that illustrates this issue

library(SummarizedExperiment)
library(MatrixGenerics)

nrows <- 20; ncols <- 6
counts <- matrix(runif(nrows * ncols, 1, 1e4), nrows)

# Correctly constructed object
rse <- SummarizedExperiment(assays=SimpleList(counts=counts))
# No error
rowSums2(assay(rse))

# Convert into df
counts_df <- DataFrame(counts)
colnames(counts_df) <- rownames(colData)

# Constructor works with df as an assay
rse_df <- SummarizedExperiment(assays=SimpleList(counts=counts_df))

# Fails because requires matrix
rowSums2(assay(rse_df))

Error in MatrixGenerics:::.load_next_suggested_package_to_search(x) : 
    Failed to find a rowSums2() method for DFrame objects.
Session info R version 4.1.2 (2021-11-01) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.3 LTS Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=fi_FI.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=fi_FI.UTF-8 [6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=fi_FI.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=fi_FI.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats4 grid stats graphics grDevices utils datasets methods base other attached packages: [1] viridis_0.6.2 viridisLite_0.4.0 vegan_2.6-2 permute_0.9-7 [5] stringr_1.4.0 scater_1.22.0 scuttle_1.4.0 scales_1.1.1 [9] plotROC_2.2.1 patchwork_1.1.1 mia_1.3.22 MultiAssayExperiment_1.20.0 [13] TreeSummarizedExperiment_2.2.0 Biostrings_2.62.0 XVector_0.34.0 SingleCellExperiment_1.16.0 [17] SummarizedExperiment_1.24.0 Biobase_2.54.0 GenomicRanges_1.46.1 GenomeInfoDb_1.30.1 [21] IRanges_2.28.0 S4Vectors_0.32.4 BiocGenerics_0.40.0 MatrixGenerics_1.6.0 [25] matrixStats_0.61.0 knitr_1.38 ggsignif_0.6.3 ggrepel_0.9.1 [29] ggplotify_0.1.0 ggord_1.1.6 ggdendro_0.1.23 dplyr_1.0.8 [33] dendextend_1.15.2 cutpointr_1.1.1 ComplexHeatmap_2.11.1 circlize_0.4.14 [37] caret_6.0-90 lattice_0.20-45 ggplot2_3.3.5 BiocManager_1.30.16 [41] rmarkdown_2.13 phyloseq_1.38.0 loaded via a namespace (and not attached): [1] utf8_1.2.2 tidyselect_1.1.2 RSQLite_2.2.11 BiocParallel_1.28.3 [5] Rtsne_0.15 pROC_1.18.0 munsell_0.5.0 ScaledMatrix_1.2.0 [9] codetools_0.2-18 xgboost_1.5.2.1 future_1.24.0 withr_2.5.0 [13] colorspace_2.0-3 highr_0.9 rstudioapi_0.13 listenv_0.8.0 [17] Rdpack_2.3 labeling_0.4.2 GenomeInfoDbData_1.2.7 bit64_4.0.5 [21] farver_2.1.0 rhdf5_2.38.1 parallelly_1.30.0 vctrs_0.3.8 [25] treeio_1.18.1 generics_0.1.2 ipred_0.9-12 xfun_0.30 [29] R6_2.5.1 doParallel_1.0.17 ggbeeswarm_0.6.0 clue_0.3-60 [33] rsvd_1.0.5 bitops_1.0-7 rhdf5filters_1.6.0 microbiome_1.16.0 [37] cachem_1.0.6 gridGraphics_0.5-1 DelayedArray_0.20.0 assertthat_0.2.1 [41] nnet_7.3-17 beeswarm_0.4.0 gtable_0.3.0 beachmat_2.10.0 [45] Cairo_1.5-15 globals_0.14.0 timeDate_3043.102 rlang_1.0.2 [49] GlobalOptions_0.1.2 splines_4.1.2 lazyeval_0.2.2 ModelMetrics_1.2.2.2 [53] yaml_2.3.5 reshape2_1.4.4 tools_4.1.2 lava_1.6.10 [57] ellipsis_0.3.2 decontam_1.14.0 jquerylib_0.1.4 biomformat_1.22.0 [61] RColorBrewer_1.1-2 proxy_0.4-26 Rcpp_1.0.8.3 plyr_1.8.7 [65] sparseMatrixStats_1.6.0 zlibbioc_1.40.0 purrr_0.3.4 RCurl_1.98-1.6 [69] rpart_4.1-15 GetoptLong_1.0.5 cowplot_1.1.1 cluster_2.1.2 [73] DECIPHER_2.22.0 magrittr_2.0.2 data.table_1.14.2 evaluate_0.15 [77] gridExtra_2.3 shape_1.4.6 compiler_4.1.2 tibble_3.1.6 [81] crayon_1.5.1 htmltools_0.5.2 mgcv_1.8-38 tidyr_1.2.0 [85] ANCOMBC_1.4.0 lubridate_1.8.0 DBI_1.1.2 MASS_7.3-56 [89] Matrix_1.4-0 ade4_1.7-18 cli_3.2.0 rbibutils_2.2.7 [93] parallel_4.1.2 gower_1.0.0 igraph_1.2.11 pkgconfig_2.0.3 [97] recipes_0.1.17 foreach_1.5.2 vipor_0.4.5 bslib_0.3.1 [101] DirichletMultinomial_1.36.0 multtest_2.50.0 prodlim_2019.11.13 yulab.utils_0.0.4 [105] digest_0.6.29 tidytree_0.3.7 DelayedMatrixStats_1.16.0 rjson_0.2.21 [109] nloptr_1.2.2.3 lifecycle_1.0.1 nlme_3.1-157 jsonlite_1.8.0 [113] Rhdf5lib_1.16.0 BiocNeighbors_1.12.0 fansi_1.0.3 pillar_1.7.0 [117] fastmap_1.1.0 survival_3.2-13 glue_1.6.2 png_0.1-7 [121] iterators_1.0.14 bit_4.0.4 class_7.3-20 stringi_1.7.6 [125] sass_0.4.1 blob_1.2.2 BiocSingular_1.10.0 memoise_2.0.1 [129] e1071_1.7-9 irlba_2.3.5 future.apply_1.8.1 ape_5.6-2

-Tuomas

hpages commented 2 years ago

Hi Tuomas,

You're raising an interesting question.

The SummarizedExperiment() constructor function has intentionally loose checking. We actually want to allow an assay to be many different things, including a data-frame-like object (i.e. data.frame, DataFrame, data.table, tibble, etc...), and not just an ordinary matrix. For example, the assay can be an array with 3 dimensions or more, a dgCMatrix object from the Matrix package, an HDF5Array object from the HDF5Array package, a TileDBArray object from the TileDBArray package, a DelayedArray object from the DelayedArray package, etc...

The SummarizedExperiment() constructor doesn't even look at the class of the supplied assay(s). It only checks that they have dimensions that are compatible with the nrow() and ncol() of the SummarizedExperiment object to construct. This "loose contract" approach makes the SummarizedExperiment container very versatile and has made it possible to use out-of-memory representations for very big assays.

But yes, one downside of this approach is that it's hard to predict what operations the assays in a given SummarizedExperiment object are going to support, as you've noticed. However, so far the advantages of the "loose contract" approach have by far outweighed its disadvantages.

Here is what ?SummarizedExperiment says about the assays argument of the SummarizedExperiment() constructor function:

  assays: A ‘list’ or ‘SimpleList’ of matrix-like elements, or a
          matrix-like object (e.g. an ordinary matrix, a data frame, a
          DataFrame object from the ‘S4Vectors’ package, a sparseMatrix
          derivative from the ‘Matrix’ package, a DelayedMatrix object
          from the ‘DelayedArray’ package, etc...).  All elements of
          the list must have the same dimensions, and dimension names
          (if present) must be consistent across elements and with the
          row names of ‘rowRanges’ and ‘colData’.

As you can see, it doesn't say that the assays need to be "in matrix format", only that they must be matrix-like objects, and data frame and DataFrame objects are explicitly listed as valid assay types.

I would say that if more checks are needed on the assays of a SummarizedExperiment object, they should be performed by the workflow or pipeline that cares about the exact nature of those assays. For example if a workflow needs to call MatrixGenerics::rowSums2() at some point on the assay of a SummarizedExperiment object, it should check as early as possible that this operation will actually work.

Hope this makes sense.

Best, H.

TuomasBorman commented 2 years ago

Thanks for your reply and detailed explanation

Hmmm, yes, I see these advantages by not restrict data into "traditional" matrix format, and it is better this way. We have to figure out how to deal with this issue in our tools

-Tuomas