Closed LTLA closed 3 years ago
Hmm... seems like we are lacking support for the lzf filter at the moment. Using cheng18.processed.h5ad
downloaded from here:
> h5read("cheng18.processed.h5ad", "/X/data", index=list(1:10))
Error: Unable to read dataset.
Not all required filters available.
Missing filters: lzf
Error: Error in h5checktype(). H5Identifier not valid.
@grimbough What are the plans for h5py LZF? Is it going to make it to rhdf5filters?
In the meantime, I've just been using files created using whatever h5py's default filters are:
library(scRNAseq)
sce <- ZeiselBrainData()
counts(sce) <- as(counts(sce), "dgCMatrix")
library(zellkonverter)
writeH5AD(sce, file="blah.h5ad")
Just to be clear: most H5AD files don't seem to use these external filters, they're just using the standard ones (probably DEFLATE). So we should still be able to proceed with these classes and get something useable for most people.
It's funny but the first H5AD file I tried had the LZF filter on it. Must be my usual good luck ;-)
Anyways I just added support for LZF compression to HDF5Array 1.19.2 (still need to test this on Windows). Once LZF makes its way to rhdf5filters, it'll be easy for me to pick it up from there.
I'll work on the H5ADMatrixSeed/H5ADMatrix classes after lunch.
Okay, thanks. Do you want H5ADMatrix
to live in HDF5Array? These classes are pretty application-specific; it may make more sense for, e.g., some virtual base class containing 99% of the code to live in HDF5Array while application-specific packages contain the concrete subclasses.
This strategy would preserve the generality of the HDF5Array package, while downstream applications would be in charge of making sure everything works with specific formats. For example, DropletUtils already has a significant testing set-up to make sure that TENxMatrix
works with all of the old and new Cellranger output formats.
It makes sense to have H5ADMatrix next to TENxMatrix.
Also, the H5ADMatrix
is stored as a CSR matrix, as the AnnData infrastructure operates with a cells-as-rows paradigm. It would have to be transposed to truly be comparable to a TENxMatrix
; we were planning to handle this at the zellkonverter level.
R already has the notions of rows and cols flipped so its natural to read these h5ad files as already transposed.
@grimbough What are the plans for h5py LZF? Is it going to make it to rhdf5filters?
LZF support is now in rhdf5filters, should propagate in a few days.
> rhdf5::h5version()
This is Bioconductor rhdf5 2.35.2 linking to C-library HDF5 1.10.7 and rhdf5filters 1.3.4
> h5read("~/Downloads/cheng18.processed.h5ad", "/X/data", index=list(1:10))
[1] 0.6057951 0.6057951 0.6057951 0.7852854 0.7852854 0.9670818 0.6607942 0.6607942
[9] 0.6607942 0.8280389
That should add both read and write support if it's ever needed.
Thx @grimbough ! I'll drop my copy of the LZF code from HDF5Array.
@LTLA : Basic support for H5ADMatrix objects is now in HDF5Array 1.19.4. Still needs a few more examples and some unit tests. Let me know how it works for you.
Thanks @hpages. To set the context, I'm testing it on the mock dataset described in theislab/zellkonverter#37, i.e.:
library(scRNAseq)
out <- ZeiselBrainData()
counts(out) <- as(counts(out), "dgCMatrix")
library(zellkonverter)
writeH5AD(out, file="stuff.h5")
The first point is that, while X
is the default name, it's not the only possible group path. The H5AD format also have layers
, akin to SE's assays
, where each layer is a separate assay. This would be easily supported by allowing the H5ADMatrix
constructor to accept a path in the same manner that HDF5Array()
and TENxMatrix()
do.
Secondly, for the file generated above, the shape attributes are not named h5sparse_shape
, but instead shape
. zellkonverter generates these files via a reasonably up-to-date Python interface, so is as close to the truth as can be; I assume you also got your test H5AD files from some trusted source, which suggests that there was some kind of change to the format at some point. Guess you could just check for one or the other and just use whatever's there.
Thx, I'll implement these improvements (should be easy). Since I didn't manage to put my hands on many .h5ad
files yet, and since the .h5ad
specs are kind of minimalist (to say the least), that kind of feedback is useful. Reverse engineering can be fun but is only what it is: a hack.
Still need to test the H5ADMatrix stuff on all kinds of .h5ad
files to polish all these rough edges.
Done in HDF5Array 1.19.5. Shouldn't writeH5AD()
write the dgCMatrix assay in csr format instead of csc? The matrix gets transposed when written to the .h5ad
file so csr in R should become csc in the file.
Just tested it, looks good to me.
Shouldn't
writeH5AD()
write the dgCMatrix assay in csr format instead of csc?
¯\_(ツ)_/¯. This is handled by the Python interface. All I can say is that we manually transpose the matrix when we drag it from R into Python, otherwise the Python-side object (their equivalent to our SE) cannot be properly constructed. This manual transposition is still a CSC matrix; it's just that the Python-side columns are equivalent to the R-side rows.
I see. Looks like transposing a dgCMatrix could have returned a dgRMatrix instead of a dgCMatrix. This would avoid the cost of re-computing the row indices:
library(Matrix)
transpose_dgCMatrix_to_dgRMatrix <- function(x)
{
stopifnot(is(x, "dgCMatrix"))
## Nothing to compute, just moving the slot values around!
new("dgRMatrix", p=x@p, j=x@i, Dim=rev(x@Dim), Dimnames=rev(x@Dimnames), x=x@x)
}
m0 <- rsparsematrix(1e6, 1e3, 0.1)
system.time(m1 <- t(m0))
# user system elapsed
# 3.647 0.704 4.351
class(m1)
# [1] "dgCMatrix"
# attr(,"package")
[1] "Matrix"
dim(m1)
# [1] 1000 1000000
system.time(m2 <- transpose_dgCMatrix_to_dgRMatrix(m0))
# user system elapsed
# 0.708 0.000 0.709
class(m2)
# [1] "dgRMatrix"
# attr(,"package")
# [1] "Matrix"
dim(m2)
# [1] 1000 1000000
So I'm going to close this. If any problem with the H5ADMatrix
/H5ADMatrixSeed stuff
, please open a new issue.
From theislab/zellkonverter#37:
It seems that the
TENxMatrix
almost works with AnnData's sparse HDF5 format. Only a small modification is required, namely:And then it all works, with an extra transposition (because they store it row-based). Would it be possible to convert some of these internals into generics so that I can create a
TENxMatrix
subclass to handle these sparse AnnData datasets?Might even make sense to have a more generally named virtual base class, e.g.,
CompressedSparseHDF5Matrix
. And then subclasses could define generics to provide the paths to the elements, re-using the rest of the existingTENxMatrix
machinery to do all the data extraction.