MarioniLab / DropletUtils

Clone of the Bioconductor repository for the DropletUtils package.
https://bioconductor.org/packages/devel/bioc/html/DropletUtils.html
54 stars 27 forks source link

read10xCounts(..., type = "HDF5") doesn't produce TENxMatrix? #15

Open PeteHaitch opened 5 years ago

PeteHaitch commented 5 years ago

Hi Aaron, I used read10xCounts(..., type = "HDF5") to load a HDF5 file from CellRanger v3. I was surprised to get an error when I then ran scater::calculateQCMetrics() on the result. Ultimately, I think the issue is that read10xCounts(..., type = "HDF5") doesn't return the counts matrix as a 'pristine' TENxMatrix. I encountered this using v1.4.1 of DropletUtils and can't test on devel right now, but I didn't see any recent changes around this behaviour. This example hopefully illustrates the issue. What do you think? Thanks!

suppressPackageStartupMessages(library(DropletUtils))
example("write10xCounts", echo = FALSE)
#> 
#> Attaching package: 'Matrix'
#> The following object is masked from 'package:S4Vectors':
#> 
#>     expand
tmpfile <- tempfile(fileext = ".h5")
write10xCounts(tmpfile, my.counts, gene.id=gene.ids, 
               gene.symbol=gene.symb, barcodes=cell.ids,
               type="HDF5", version="3")
sce <- read10xCounts(tmpfile)
# This error was the first indication something wasn't as I expected.
scater::calculateQCMetrics(sce)
#> Registered S3 methods overwritten by 'ggplot2':
#>   method         from 
#>   [.quosures     rlang
#>   c.quosures     rlang
#>   print.quosures rlang
#> Error in (function (exprs_mat, start, end, all_feature_sets, all_cell_sets, : Evaluation error: netSubsetAndAperm() is not supported on a DelayedArray object with
#>   multiple seeds at the moment. Note that you can check the number
#>   of seeds with nseed()..
# I expected this to be a TENxMatrix.
class(counts(sce, withDimnames = FALSE))
#> [1] "DelayedMatrix"
#> attr(,"package")
#> [1] "DelayedArray"
# It almost is, but what's with the delayed ops in here?
str(counts(sce, withDimnames = FALSE))
#> Formal class 'DelayedMatrix' [package "DelayedArray"] with 1 slot
#>   ..@ seed:Formal class 'DelayedAbind' [package "DelayedArray"] with 2 slots
#>   .. .. ..@ along: int 2
#>   .. .. ..@ seeds:List of 1
#>   .. .. .. ..$ :Formal class 'DelayedDimnames' [package "DelayedArray"] with 2 slots
#>   .. .. .. .. .. ..@ dimnames:List of 2
#>   .. .. .. .. .. .. ..$ : int -1
#>   .. .. .. .. .. .. ..$ : NULL
#>   .. .. .. .. .. ..@ seed    :Formal class 'TENxMatrixSeed' [package "HDF5Array"] with 5 slots
#>   .. .. .. .. .. .. .. ..@ filepath  : chr "/tmp/RtmpDw8AyK/file28625fac52e3.h5"
#>   .. .. .. .. .. .. .. ..@ group     : chr "matrix"
#>   .. .. .. .. .. .. .. ..@ dim       : int [1:2] 100 10
#>   .. .. .. .. .. .. .. ..@ dimnames  :List of 2
#>   .. .. .. .. .. .. .. .. ..$ : NULL
#>   .. .. .. .. .. .. .. .. ..$ : chr [1:10] "BARCODE-1" "BARCODE-2" "BARCODE-3" "BARCODE-4" ...
#>   .. .. .. .. .. .. .. ..@ col_ranges:'data.frame':  10 obs. of  2 variables:
#>   .. .. .. .. .. .. .. .. ..$ start: num [1:10] 1 101 200 298 397 496 595 693 791 891
#>   .. .. .. .. .. .. .. .. ..$ width: int [1:10] 100 99 98 99 99 99 98 98 100 100

# Forcing the assay to be a TENxMatrix means scater::calculateQCMetrics() works.
counts(sce, withDimnames = FALSE) <- as(my.counts, "TENxMatrix")
scater::calculateQCMetrics(sce)
#> class: SingleCellExperiment 
#> dim: 100 10 
#> metadata(0):
#> assays(1): counts
#> rownames(100): ENSG00001 ENSG00002 ... ENSG000099 ENSG0000100
#> rowData names(10): ID Symbol ... total_counts log10_total_counts
#> colnames: NULL
#> colData names(11): Sample Barcode ...
#>   pct_counts_in_top_200_features pct_counts_in_top_500_features
#> reducedDimNames(0):
#> spikeNames(0):

Created on 2019-06-15 by the reprex package (v0.3.0)

Session info ``` r devtools::session_info() #> ─ Session info ────────────────────────────────────────────────────────── #> setting value #> version R version 3.6.0 (2019-04-26) #> os Ubuntu 18.04.2 LTS #> system x86_64, linux-gnu #> ui X11 #> language en_AU:en #> collate en_AU.UTF-8 #> ctype en_AU.UTF-8 #> tz Australia/Melbourne #> date 2019-06-15 #> #> ─ Packages ────────────────────────────────────────────────────────────── #> package * version date lib source #> assertthat 0.2.1 2019-03-21 [3] CRAN (R 3.5.3) #> backports 1.1.4 2019-04-10 [3] CRAN (R 3.5.3) #> beachmat 2.0.0 2019-05-02 [1] Bioconductor #> beeswarm 0.2.3 2016-04-25 [1] CRAN (R 3.6.0) #> Biobase * 2.44.0 2019-05-02 [1] Bioconductor #> BiocGenerics * 0.30.0 2019-05-02 [1] Bioconductor #> BiocNeighbors 1.2.0 2019-05-02 [1] Bioconductor #> BiocParallel * 1.18.0 2019-05-03 [1] Bioconductor #> BiocSingular 1.0.0 2019-05-02 [1] Bioconductor #> bitops 1.0-6 2013-08-17 [3] CRAN (R 3.5.0) #> callr 3.2.0 2019-03-15 [3] CRAN (R 3.5.3) #> cli 1.1.0 2019-03-19 [3] CRAN (R 3.5.3) #> colorspace 1.4-1 2019-03-18 [3] CRAN (R 3.5.3) #> crayon 1.3.4 2017-09-16 [3] CRAN (R 3.5.0) #> DelayedArray * 0.10.0 2019-05-02 [1] Bioconductor #> DelayedMatrixStats 1.6.0 2019-05-02 [1] Bioconductor #> desc 1.2.0 2018-05-01 [3] CRAN (R 3.5.0) #> devtools 2.0.2 2019-04-08 [1] CRAN (R 3.6.0) #> digest 0.6.19 2019-05-20 [3] CRAN (R 3.6.0) #> dplyr 0.8.1 2019-05-14 [3] CRAN (R 3.6.0) #> dqrng 0.2.1 2019-05-17 [1] CRAN (R 3.6.0) #> DropletUtils * 1.4.1 2019-05-31 [1] Bioconductor #> edgeR 3.26.4 2019-05-27 [1] Bioconductor #> evaluate 0.14 2019-05-28 [3] CRAN (R 3.6.0) #> fs 1.3.1 2019-05-06 [3] CRAN (R 3.6.0) #> GenomeInfoDb * 1.20.0 2019-05-02 [1] Bioconductor #> GenomeInfoDbData 1.2.1 2019-05-07 [1] Bioconductor #> GenomicRanges * 1.36.0 2019-05-02 [1] Bioconductor #> ggbeeswarm 0.6.0 2017-08-07 [1] CRAN (R 3.6.0) #> ggplot2 3.1.1 2019-04-07 [3] CRAN (R 3.5.3) #> glue 1.3.1 2019-03-12 [3] CRAN (R 3.5.3) #> gridExtra 2.3 2017-09-09 [1] CRAN (R 3.6.0) #> gtable 0.3.0 2019-03-25 [3] CRAN (R 3.5.3) #> HDF5Array 1.12.1 2019-05-07 [1] Bioconductor #> highr 0.8 2019-03-20 [3] CRAN (R 3.5.3) #> htmltools 0.3.6 2017-04-28 [3] CRAN (R 3.5.0) #> IRanges * 2.18.1 2019-05-31 [1] Bioconductor #> irlba 2.3.3 2019-02-05 [1] CRAN (R 3.6.0) #> knitr 1.23 2019-05-18 [3] CRAN (R 3.6.0) #> lattice 0.20-38 2018-11-04 [4] CRAN (R 3.5.1) #> lazyeval 0.2.2 2019-03-15 [3] CRAN (R 3.5.3) #> limma 3.40.2 2019-05-17 [1] Bioconductor #> locfit 1.5-9.1 2013-04-20 [1] CRAN (R 3.6.0) #> magrittr 1.5 2014-11-22 [3] CRAN (R 3.5.0) #> Matrix * 1.2-17 2019-03-22 [4] CRAN (R 3.5.3) #> matrixStats * 0.54.0 2018-07-23 [1] CRAN (R 3.6.0) #> memoise 1.1.0 2017-04-21 [1] CRAN (R 3.6.0) #> munsell 0.5.0 2018-06-12 [3] CRAN (R 3.5.0) #> pillar 1.4.1 2019-05-28 [3] CRAN (R 3.6.0) #> pkgbuild 1.0.3 2019-03-20 [3] CRAN (R 3.5.3) #> pkgconfig 2.0.2 2018-08-16 [3] CRAN (R 3.5.1) #> pkgload 1.0.2 2018-10-29 [3] CRAN (R 3.5.1) #> plyr 1.8.4 2016-06-08 [3] CRAN (R 3.5.0) #> prettyunits 1.0.2 2015-07-13 [3] CRAN (R 3.5.0) #> processx 3.3.1 2019-05-08 [3] CRAN (R 3.6.0) #> ps 1.3.0 2018-12-21 [3] CRAN (R 3.5.2) #> purrr 0.3.2 2019-03-15 [3] CRAN (R 3.5.3) #> R.methodsS3 1.7.1 2016-02-16 [1] CRAN (R 3.6.0) #> R.oo 1.22.0 2018-04-22 [1] CRAN (R 3.6.0) #> R.utils 2.9.0 2019-06-13 [1] CRAN (R 3.6.0) #> R6 2.4.0 2019-02-14 [3] CRAN (R 3.5.2) #> Rcpp 1.0.1 2019-03-17 [3] CRAN (R 3.5.3) #> RCurl 1.95-4.12 2019-03-04 [3] CRAN (R 3.5.2) #> remotes 2.0.4 2019-04-10 [1] CRAN (R 3.6.0) #> rhdf5 2.28.0 2019-05-02 [1] Bioconductor #> Rhdf5lib 1.6.0 2019-05-02 [1] Bioconductor #> rlang 0.3.4 2019-04-07 [3] CRAN (R 3.5.3) #> rmarkdown 1.13 2019-05-22 [3] CRAN (R 3.6.0) #> rprojroot 1.3-2 2018-01-03 [3] CRAN (R 3.5.3) #> rsvd 1.0.1 2019-06-02 [1] CRAN (R 3.6.0) #> S4Vectors * 0.22.0 2019-05-02 [1] Bioconductor #> scales 1.0.0 2018-08-09 [3] CRAN (R 3.5.1) #> scater 1.12.2 2019-05-24 [1] Bioconductor #> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.6.0) #> SingleCellExperiment * 1.6.0 2019-05-02 [1] Bioconductor #> stringi 1.4.3 2019-03-12 [3] CRAN (R 3.5.3) #> stringr 1.4.0 2019-02-10 [3] CRAN (R 3.5.2) #> SummarizedExperiment * 1.14.0 2019-05-02 [1] Bioconductor #> testthat 2.1.1 2019-04-23 [1] CRAN (R 3.6.0) #> tibble 2.1.2 2019-05-29 [3] CRAN (R 3.6.0) #> tidyselect 0.2.5 2018-10-11 [3] CRAN (R 3.5.1) #> usethis 1.5.0 2019-04-07 [1] CRAN (R 3.6.0) #> vipor 0.4.5 2017-03-22 [1] CRAN (R 3.6.0) #> viridis 0.5.1 2018-03-29 [1] CRAN (R 3.6.0) #> viridisLite 0.3.0 2018-02-01 [3] CRAN (R 3.5.0) #> withr 2.1.2 2018-03-15 [3] CRAN (R 3.5.0) #> xfun 0.7 2019-05-14 [3] CRAN (R 3.6.0) #> XVector 0.24.0 2019-05-02 [1] Bioconductor #> yaml 2.2.0 2018-07-25 [3] CRAN (R 3.5.1) #> zlibbioc 1.30.0 2019-05-02 [1] Bioconductor #> #> [1] /home/peter/R/x86_64-pc-linux-gnu-library/3.6 #> [2] /usr/local/lib/R/site-library #> [3] /usr/lib/R/site-library #> [4] /usr/lib/R/library ```
LTLA commented 5 years ago

The two ops are a cbind and rownames<-. The former is done even when there is only one matrix, to avoid having multiple code paths. The latter is done for obvious reasons.

Both could be skipped, I guess.

PeteHaitch commented 5 years ago

Do you see this as an issue for DropletUtils, scater, or DelayedArray?

LTLA commented 5 years ago

It is a DelayedArray and TENxMatrix issue for me. There are multiple levels, though. Starting at the base; it seems like it would be sensible for cbind on DelayedMatrix instances to be smart enough to avoid wrapping things in an Abind if there's only one member. A no-op seems fairly reasonable here.

A <- DelayedArray(matrix(1,10,10))
showtree(cbind(A))
## 10x10 double: DelayedMatrix object
## └─ 10x10 double: Abind (along=2)  <- don't want this guy.
##    └─ 10x10 double: [seed] matrix object

The error observed in scater arises from beachmat, which attempts to call netSubsetAndPerm on the DA'd TENxMatrix. This fails for the stated reason - fair enough - but it only got that far because:

contentIsPristine(cbind(A)) # TRUE. Well, fair enough.
contentIsPristine(cbind(A, A)) # Still TRUE, but it's got a content-modifying delayed op!

The first TRUE would make sense if cbind(A) was truly a no-op. The second TRUE doesn't make sense, and should be FALSE, which would have allowed beachmat (and thus scater) to proceed.

The second orthogonal issue is that dimnames<- on a TENxMatrix should be able to directly modify the dimension names in the seed, rather than wrapping things in a DelayedDimnames. Yes, I know this means that the dimension names in the seed are now different from the dimension names in the file... but if I wanted the names in the file, I would just read them in again, so it's not a big deal to me.

The fact that dimnames<- wraps things in a delayed op is pretty annoying in many contexts. I've got a lot of code that fiddles with dimnames routinely, which means that things collapse into DAs at the slightest provocation; this could be problematic in the future if you have classes with specialized algorithms that avoid block-processing (e.g., BiocSingular's ResidualMatrix for %*%).

LTLA commented 5 years ago

Better rope in @hpages.

hpages commented 5 years ago

Hi guys,

I agree that unary cbind() and rbind() should be simplified. Just did that: https://github.com/Bioconductor/DelayedArray/commit/9198e3159e005d155ad176f0606ce051780bb7f4

contentIsPristine() means that the content of the seeds (i.e. their array values) has not been touched by the delayed operations. So the following behaves as expected because the values you see when displaying the object are the unmodified values from the seed(s):

> library(DelayedArray)
> m1 <- matrix(runif(150), nrow=15, ncol=10)
> M1 <- DelayedArray(m1)
> contentIsPristine(M1[-1,])
[1] TRUE
> contentIsPristine(t(M1[-1,]))
[1] TRUE
> contentIsPristine(cbind(M1, M1))
[1] TRUE
> contentIsPristine(t(cbind(M1, M1)))
[1] TRUE

These are not pristine DelayedArray objects:

> contentIsPristine(M1 + 1)
[1] FALSE
> contentIsPristine(t(M1) + 1)
[1] FALSE

Note that contentIsPristine() sometimes produces false negatives:

> contentIsPristine(M1 + 0)
[1] FALSE

If you want a more stringent test that also checks that the array values in DelayedArray object x are aligned with the values in its seed (i.e. the object has 1 seed and carries no subsetting or permutation of the dimensions), you could use contentIsPristine(x) && is_aligned_with_seed(x) where is_aligned_with_seed() is the helper function defined in the examples section of ?netSubsetAndAperm. (I believe that this test is actually equivalent to checking whether the simplified tree of delayed ops reduces to at most a single "set the dimnames" op i.e. to a single DelayedDimnames node. Would need to double-check that though.)

Anyway, @LTLA I have a feeling that you have a use case for this test. I could rename the current DelayedArray:::is_pristine() -> isPristine() and export it, and add the ignore.dimnames arg to it. ignore.dimnames would be false by default. When set to TRUE, it would check pristineness modulo the dimnames. Would that work?

hpages commented 5 years ago

Also I'm open to suggestions for a better name for contentIsPristine(). untouchedValues()? untouchedArrayValues()? valuesAreUntouched()? arrayValuesAreUntouched()? delayedOpsPreserveTheOriginalArrayValues()? something else?

Or should we have a function that does the opposite e.g. valuesAreTouched()? arrayValuesAreTouched()? delayedOpsTouchTheArrayValues()? etc...

Thanks!

LTLA commented 5 years ago

Having thought about it, I'm happy for contentIsPristine() to return TRUE in the case of combined arrays. This does, however, require beachmat to use is_aligned_with_seed() to avoid allowing bind'ed arrays to sneak through to the netPermAndSubset() call in setupDelayedMatrix() (upon which it fails). I can do this.

On the dimnames setting; this is not a problem for beachmat, which is smart enough to see through the delayed op and avoid block processing. I'm more worried about R-level code - imagine a developer is careful enough to write a DA backend that specializes operations like matrix multiplication for efficiency, but fails to write a specialized dimnames<- method. This means that their backend collapses to a DA when anyone sets dimnames on it (which would be fairly often), causing the matrix multiplication to switch back to using block processing.

It is for this reason that all BiocSingular matrix classes overwrite dimnames<-. Perhaps this should be a recommendation for all DA backends.

LTLA commented 5 years ago

Turns out I was already doing the check with nseed(). Anyway, I think the beachmat code should now work, as the cbind is now a no-op. The names still introduce a DelayedDimnames wrapper but beachmat should be able to see past that.

hpages commented 5 years ago

If we do something like this:

setMethod("%*%", c("DelayedMatrix", "DelayedMatrix"),
    function(x, y)
    {
        if (!(hasDelayedDimnames(x) || hasDelayedDimnames(y)))
            return(.BLOCK_matrix_mult(x, y))
        x0 <- dropDelayedDimnames(x)
        y0 <- dropDelayedDimnames(y)
        ans <- x0 %*% y0  # Will dispatch on specialized method if any, or call this
                          # method again (infinite recursion is avoided because
                          # hasDelayedDimnames() will now be FALSE on both x0 and y0).
        dimnames(ans) <- list(rownames(x), colnames(y))
        ans
    }
)

then there is no need for DeferredMatrix or any DA backend to overwrite dimnames<-.

The seed represents a read-only dataset (everything in the dataset is read-only, including the dimnames) so modifying the dimnames at the seed level is a violation of that. The consequence typically being that the seed is no longer in sync with the dataset that it represents so things like DelayedArray:::is_pristine(x) and as(x, "Some_DA_Backend") no longer do the right thing (the former returns FALSE positives and the latter is a no-op when it should trigger realization).

I'll modify %*% to do something like the above. All block-processed methods (max(), colSums(), etc...) could actually be modified to call the backend-specific method if the only delayed op carried by the supplied DelayedArray object is a DelayedDimnames. DA backend authors should definitely be aware of that so 02-Implementing_a_backend.Rmd will need to be updated.

LTLA commented 5 years ago

The seed represents a read-only dataset (everything in the dataset is read-only, including the dimnames) so modifying the dimnames at the seed level is a violation of that.

Is that a policy for all DA backends, or just for file-backed backends? In some of the BiocSingular backends, I have specialized some operations on the *Matrix class to modify their *MatrixSeed directly. This is all in memory so it's cheap for me, and it's convenient as it means that the *Matrix does not collapse to a DA (and thus use block processing) during routine manipulations inside the package.

To give a concrete example, I do a lot of transpositions in BiocSingular. To avoid everything getting wrapped in Aperms when the same code is used on one of my *Matrix instances, I have a specialized t,*Matrix-method that simply flips a transposed flag inside the seed. From what you're describing, this sounds like a breach of the DA backend contract, though I'm not sure that it has any important practical consequences because there's no file to make the lack of delayed operations seem contradictory.

hpages commented 5 years ago

Yeah the "never touch the seed" principle in DelayedArray is a central one. No matter what operations are performed on M, the seeds of the result should always be identical to the original seeds e.g. seed(t(M)) should always be identical to seed(M). This is regardless of the kind of seed (DelayedArray makes no distinction between in-memory and on-disk seeds). DelayedArray was designed on top of that principle and some of its code actually assumes that the seed is never touched.

You might be fine for now with BiocSingular, hard to know, but it's a risky game.

I will emphasize the "never touch the seed" principle in 02-Implementing_a_backend.Rmd.

LTLA commented 5 years ago

Hmmmm. Is there any way to divert %*% from block-processing if the only delayed operations are subsetting or transposition? If so, I could make the seed immutable without loss of efficiency.

For example, transposition could just dispatch to crossprod(), tcrossprod() etc. on the untransposed matrices. And we could just have some other generic like subsetSeed() that operates strictly internal to the scope of this function and creates a "natively subsetted" *Matrix instance if specialized to do so.

hpages commented 5 years ago

Is there any way to divert %*% from block-processing if the only delayed operations are subsetting or transposition?

I hope so.

In the meantime, I've added the "Implementing optimized backend-specific methods" subsection to 02-Implementing_a_backend.Rmd (with a note about the "never touch the seed" principle): https://github.com/Bioconductor/DelayedArray/blob/eb478be25494b1e64232ede392bc3776415bd4da/vignettes/02-Implementing_a_backend.Rmd#L305-L341