Extending TENxMatrix to work on variants of HDF5-backed storage

LTLA commented 3 years ago

From theislab/zellkonverter#37:

It seems that the TENxMatrix almost works with AnnData's sparse HDF5 format. Only a small modification is required, namely:

.get_tenx_shape <- function(filepath, group) {
    as.integer(h5readAttributes(filepath, group)$shape)
 #   .read_tenx_component(filepath, group, "shape")
}

And then it all works, with an extra transposition (because they store it row-based). Would it be possible to convert some of these internals into generics so that I can create a TENxMatrix subclass to handle these sparse AnnData datasets?

Might even make sense to have a more generally named virtual base class, e.g., CompressedSparseHDF5Matrix. And then subclasses could define generics to provide the paths to the elements, re-using the rest of the existing TENxMatrix machinery to do all the data extraction.

hpages commented 3 years ago

Hmm... seems like we are lacking support for the lzf filter at the moment. Using cheng18.processed.h5ad downloaded from here:

> h5read("cheng18.processed.h5ad", "/X/data", index=list(1:10))
Error: Unable to read dataset.
Not all required filters available.
Missing filters: lzf
Error: Error in h5checktype(). H5Identifier not valid.

@grimbough What are the plans for h5py LZF? Is it going to make it to rhdf5filters?

LTLA commented 3 years ago

In the meantime, I've just been using files created using whatever h5py's default filters are:

library(scRNAseq)
sce <- ZeiselBrainData()
counts(sce) <- as(counts(sce), "dgCMatrix")

library(zellkonverter)
writeH5AD(sce, file="blah.h5ad")

LTLA commented 3 years ago

Just to be clear: most H5AD files don't seem to use these external filters, they're just using the standard ones (probably DEFLATE). So we should still be able to proceed with these classes and get something useable for most people.

hpages commented 3 years ago

It's funny but the first H5AD file I tried had the LZF filter on it. Must be my usual good luck ;-)

Anyways I just added support for LZF compression to HDF5Array 1.19.2 (still need to test this on Windows). Once LZF makes its way to rhdf5filters, it'll be easy for me to pick it up from there.

I'll work on the H5ADMatrixSeed/H5ADMatrix classes after lunch.

LTLA commented 3 years ago

Okay, thanks. Do you want H5ADMatrix to live in HDF5Array? These classes are pretty application-specific; it may make more sense for, e.g., some virtual base class containing 99% of the code to live in HDF5Array while application-specific packages contain the concrete subclasses.

This strategy would preserve the generality of the HDF5Array package, while downstream applications would be in charge of making sure everything works with specific formats. For example, DropletUtils already has a significant testing set-up to make sure that TENxMatrix works with all of the old and new Cellranger output formats.

hpages commented 3 years ago

It makes sense to have H5ADMatrix next to TENxMatrix.

LTLA commented 3 years ago

Also, the H5ADMatrix is stored as a CSR matrix, as the AnnData infrastructure operates with a cells-as-rows paradigm. It would have to be transposed to truly be comparable to a TENxMatrix; we were planning to handle this at the zellkonverter level.

hpages commented 3 years ago

R already has the notions of rows and cols flipped so its natural to read these h5ad files as already transposed.

grimbough commented 3 years ago

@grimbough What are the plans for h5py LZF? Is it going to make it to rhdf5filters?

LZF support is now in rhdf5filters, should propagate in a few days.

> rhdf5::h5version()
This is Bioconductor rhdf5 2.35.2 linking to C-library HDF5 1.10.7 and rhdf5filters 1.3.4
> h5read("~/Downloads/cheng18.processed.h5ad", "/X/data", index=list(1:10))
 [1] 0.6057951 0.6057951 0.6057951 0.7852854 0.7852854 0.9670818 0.6607942 0.6607942
 [9] 0.6607942 0.8280389

That should add both read and write support if it's ever needed.

hpages commented 3 years ago

Thx @grimbough ! I'll drop my copy of the LZF code from HDF5Array.

@LTLA : Basic support for H5ADMatrix objects is now in HDF5Array 1.19.4. Still needs a few more examples and some unit tests. Let me know how it works for you.

LTLA commented 3 years ago

Thanks @hpages. To set the context, I'm testing it on the mock dataset described in theislab/zellkonverter#37, i.e.:

library(scRNAseq)
out <- ZeiselBrainData()
counts(out) <- as(counts(out), "dgCMatrix")

library(zellkonverter)
writeH5AD(out, file="stuff.h5")

The first point is that, while X is the default name, it's not the only possible group path. The H5AD format also have layers, akin to SE's assays, where each layer is a separate assay. This would be easily supported by allowing the H5ADMatrix constructor to accept a path in the same manner that HDF5Array() and TENxMatrix() do.

Secondly, for the file generated above, the shape attributes are not named h5sparse_shape, but instead shape. zellkonverter generates these files via a reasonably up-to-date Python interface, so is as close to the truth as can be; I assume you also got your test H5AD files from some trusted source, which suggests that there was some kind of change to the format at some point. Guess you could just check for one or the other and just use whatever's there.

hpages commented 3 years ago

Thx, I'll implement these improvements (should be easy). Since I didn't manage to put my hands on many .h5ad files yet, and since the .h5ad specs are kind of minimalist (to say the least), that kind of feedback is useful. Reverse engineering can be fun but is only what it is: a hack.

Still need to test the H5ADMatrix stuff on all kinds of .h5ad files to polish all these rough edges.

hpages commented 3 years ago

Done in HDF5Array 1.19.5. Shouldn't writeH5AD() write the dgCMatrix assay in csr format instead of csc? The matrix gets transposed when written to the .h5ad file so csr in R should become csc in the file.

LTLA commented 3 years ago

Just tested it, looks good to me.

Shouldn't writeH5AD() write the dgCMatrix assay in csr format instead of csc?

¯\_(ツ)_/¯. This is handled by the Python interface. All I can say is that we manually transpose the matrix when we drag it from R into Python, otherwise the Python-side object (their equivalent to our SE) cannot be properly constructed. This manual transposition is still a CSC matrix; it's just that the Python-side columns are equivalent to the R-side rows.

hpages commented 3 years ago

I see. Looks like transposing a dgCMatrix could have returned a dgRMatrix instead of a dgCMatrix. This would avoid the cost of re-computing the row indices:

library(Matrix)

transpose_dgCMatrix_to_dgRMatrix <- function(x)
{
    stopifnot(is(x, "dgCMatrix"))
    ## Nothing to compute, just moving the slot values around!
    new("dgRMatrix", p=x@p, j=x@i, Dim=rev(x@Dim), Dimnames=rev(x@Dimnames), x=x@x)
}

m0 <- rsparsematrix(1e6, 1e3, 0.1)

system.time(m1 <- t(m0))
#    user  system elapsed 
#   3.647   0.704   4.351 

class(m1)
# [1] "dgCMatrix"
# attr(,"package")
[1] "Matrix"

dim(m1)
# [1]    1000 1000000

system.time(m2 <- transpose_dgCMatrix_to_dgRMatrix(m0))
#    user  system elapsed 
#   0.708   0.000   0.709 

class(m2)
# [1] "dgRMatrix"
# attr(,"package")
# [1] "Matrix"

dim(m2)
# [1]    1000 1000000

hpages commented 3 years ago

So I'm going to close this. If any problem with the H5ADMatrix/H5ADMatrixSeed stuff, please open a new issue.

Bioconductor / HDF5Array

Extending TENxMatrix to work on variants of HDF5-backed storage #34