Closed TuomasBorman closed 2 years ago
Hi Tuomas,
You're raising an interesting question.
The SummarizedExperiment()
constructor function has intentionally loose checking. We actually want to allow an assay to be many different things, including a data-frame-like object (i.e. data.frame, DataFrame, data.table, tibble, etc...), and not just an ordinary matrix. For example, the assay can be an array with 3 dimensions or more, a dgCMatrix object from the Matrix package, an HDF5Array object from the HDF5Array package, a TileDBArray object from the TileDBArray package, a DelayedArray object from the DelayedArray package, etc...
The SummarizedExperiment()
constructor doesn't even look at the class of the supplied assay(s). It only checks that they have dimensions that are compatible with the nrow()
and ncol()
of the SummarizedExperiment object to construct. This "loose contract" approach makes the SummarizedExperiment container very versatile and has made it possible to use out-of-memory representations for very big assays.
But yes, one downside of this approach is that it's hard to predict what operations the assays in a given SummarizedExperiment object are going to support, as you've noticed. However, so far the advantages of the "loose contract" approach have by far outweighed its disadvantages.
Here is what ?SummarizedExperiment
says about the assays
argument of the SummarizedExperiment()
constructor function:
assays: A ‘list’ or ‘SimpleList’ of matrix-like elements, or a
matrix-like object (e.g. an ordinary matrix, a data frame, a
DataFrame object from the ‘S4Vectors’ package, a sparseMatrix
derivative from the ‘Matrix’ package, a DelayedMatrix object
from the ‘DelayedArray’ package, etc...). All elements of
the list must have the same dimensions, and dimension names
(if present) must be consistent across elements and with the
row names of ‘rowRanges’ and ‘colData’.
As you can see, it doesn't say that the assays need to be "in matrix format", only that they must be matrix-like objects, and data frame and DataFrame objects are explicitly listed as valid assay types.
I would say that if more checks are needed on the assays of a SummarizedExperiment object, they should be performed by the workflow or pipeline that cares about the exact nature of those assays. For example if a workflow needs to call MatrixGenerics::rowSums2()
at some point on the assay of a SummarizedExperiment object, it should check as early as possible that this operation will actually work.
Hope this makes sense.
Best, H.
Thanks for your reply and detailed explanation
Hmmm, yes, I see these advantages by not restrict data into "traditional" matrix format, and it is better this way. We have to figure out how to deal with this issue in our tools
-Tuomas
Hi,
assay
inassays
needs to be inmatrix
format. However, constructor does not check that theassay
is in right format--> you can store
DataFrame
object as anassay
--> because many functions expectsassay
inmatrix
format this leads to an errorWe have noticed that many beginners make this same mistake. This is quite hard to solve especially when some function uses internal function that expects the data in
matrix
format, e.g.MatrixGenerics::rowSums2
.I was wondering if constructor could check that assay is in right format. If not, the constructor could throw an error, convert
DataFrame
intomatrix
(maybe not), or print a message (maybe the most preferable).Below is an example that illustrates this issue
Session info
R version 4.1.2 (2021-11-01) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.3 LTS Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=fi_FI.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=fi_FI.UTF-8 [6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=fi_FI.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=fi_FI.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats4 grid stats graphics grDevices utils datasets methods base other attached packages: [1] viridis_0.6.2 viridisLite_0.4.0 vegan_2.6-2 permute_0.9-7 [5] stringr_1.4.0 scater_1.22.0 scuttle_1.4.0 scales_1.1.1 [9] plotROC_2.2.1 patchwork_1.1.1 mia_1.3.22 MultiAssayExperiment_1.20.0 [13] TreeSummarizedExperiment_2.2.0 Biostrings_2.62.0 XVector_0.34.0 SingleCellExperiment_1.16.0 [17] SummarizedExperiment_1.24.0 Biobase_2.54.0 GenomicRanges_1.46.1 GenomeInfoDb_1.30.1 [21] IRanges_2.28.0 S4Vectors_0.32.4 BiocGenerics_0.40.0 MatrixGenerics_1.6.0 [25] matrixStats_0.61.0 knitr_1.38 ggsignif_0.6.3 ggrepel_0.9.1 [29] ggplotify_0.1.0 ggord_1.1.6 ggdendro_0.1.23 dplyr_1.0.8 [33] dendextend_1.15.2 cutpointr_1.1.1 ComplexHeatmap_2.11.1 circlize_0.4.14 [37] caret_6.0-90 lattice_0.20-45 ggplot2_3.3.5 BiocManager_1.30.16 [41] rmarkdown_2.13 phyloseq_1.38.0 loaded via a namespace (and not attached): [1] utf8_1.2.2 tidyselect_1.1.2 RSQLite_2.2.11 BiocParallel_1.28.3 [5] Rtsne_0.15 pROC_1.18.0 munsell_0.5.0 ScaledMatrix_1.2.0 [9] codetools_0.2-18 xgboost_1.5.2.1 future_1.24.0 withr_2.5.0 [13] colorspace_2.0-3 highr_0.9 rstudioapi_0.13 listenv_0.8.0 [17] Rdpack_2.3 labeling_0.4.2 GenomeInfoDbData_1.2.7 bit64_4.0.5 [21] farver_2.1.0 rhdf5_2.38.1 parallelly_1.30.0 vctrs_0.3.8 [25] treeio_1.18.1 generics_0.1.2 ipred_0.9-12 xfun_0.30 [29] R6_2.5.1 doParallel_1.0.17 ggbeeswarm_0.6.0 clue_0.3-60 [33] rsvd_1.0.5 bitops_1.0-7 rhdf5filters_1.6.0 microbiome_1.16.0 [37] cachem_1.0.6 gridGraphics_0.5-1 DelayedArray_0.20.0 assertthat_0.2.1 [41] nnet_7.3-17 beeswarm_0.4.0 gtable_0.3.0 beachmat_2.10.0 [45] Cairo_1.5-15 globals_0.14.0 timeDate_3043.102 rlang_1.0.2 [49] GlobalOptions_0.1.2 splines_4.1.2 lazyeval_0.2.2 ModelMetrics_1.2.2.2 [53] yaml_2.3.5 reshape2_1.4.4 tools_4.1.2 lava_1.6.10 [57] ellipsis_0.3.2 decontam_1.14.0 jquerylib_0.1.4 biomformat_1.22.0 [61] RColorBrewer_1.1-2 proxy_0.4-26 Rcpp_1.0.8.3 plyr_1.8.7 [65] sparseMatrixStats_1.6.0 zlibbioc_1.40.0 purrr_0.3.4 RCurl_1.98-1.6 [69] rpart_4.1-15 GetoptLong_1.0.5 cowplot_1.1.1 cluster_2.1.2 [73] DECIPHER_2.22.0 magrittr_2.0.2 data.table_1.14.2 evaluate_0.15 [77] gridExtra_2.3 shape_1.4.6 compiler_4.1.2 tibble_3.1.6 [81] crayon_1.5.1 htmltools_0.5.2 mgcv_1.8-38 tidyr_1.2.0 [85] ANCOMBC_1.4.0 lubridate_1.8.0 DBI_1.1.2 MASS_7.3-56 [89] Matrix_1.4-0 ade4_1.7-18 cli_3.2.0 rbibutils_2.2.7 [93] parallel_4.1.2 gower_1.0.0 igraph_1.2.11 pkgconfig_2.0.3 [97] recipes_0.1.17 foreach_1.5.2 vipor_0.4.5 bslib_0.3.1 [101] DirichletMultinomial_1.36.0 multtest_2.50.0 prodlim_2019.11.13 yulab.utils_0.0.4 [105] digest_0.6.29 tidytree_0.3.7 DelayedMatrixStats_1.16.0 rjson_0.2.21 [109] nloptr_1.2.2.3 lifecycle_1.0.1 nlme_3.1-157 jsonlite_1.8.0 [113] Rhdf5lib_1.16.0 BiocNeighbors_1.12.0 fansi_1.0.3 pillar_1.7.0 [117] fastmap_1.1.0 survival_3.2-13 glue_1.6.2 png_0.1-7 [121] iterators_1.0.14 bit_4.0.4 class_7.3-20 stringi_1.7.6 [125] sass_0.4.1 blob_1.2.2 BiocSingular_1.10.0 memoise_2.0.1 [129] e1071_1.7-9 irlba_2.3.5 future.apply_1.8.1 ape_5.6-2-Tuomas