Open mojaveazure opened 1 year ago
zellkonverter has some code for this:
I use a variant of this for some in-house applications.
Hi @mojaveazure,
I suppose HDF5Array could provide an H5ADRealizationSink class similar to the existing TENxRealizationSink class. The latter can be used to write blocks of a sparse matrix to the 10x Genomics sparse format. It is used internally by writeTENxMatrix()
.
I just want to emphasize the fact that, like for the other data writing capabilities in HDF5Array, this new functionality would focus on writing the count data to disk, plus possibly the rownames (/var/_index
in the h5ad
file) and colnames (/obs/_index
in the h5ad
file) if present. This is because HDF5Array is meant to be a low-level package so all the other things that are typically found in an h5ad
file would need to be taken care of by some higher-level functionalities implemented somewhere else.
I'm not too familiar with the h5ad
format but is an h5ad
file with only the /X
group plus /obs/_index
and /var/_index
datasets considered valid? This is what the user would end up with when writing to a new file.
I was also going to mention zellkonverter but I see that @LTLA just did this while I was typing my answer. I only knew about zellkonverter::readH5AD()
and zellkonverter::writeH5AD()
though, to read/write a SingleCellExperiment object from/to an h5ad
file. But it seems that they also have low-level utilities for writing data by block so maybe that's it :smile:
H.
IMO a worthwhile functionality would be more like a H5SparseRealizationSink
that could be re-used in higher-level packages like zellkonverter. Then HDF5Array just needs to concern itself with the block-wise deposition of CSC/R content into a HDF5 file, while higher-level packages like zellkonverter can worry about the formatting miscellany.
Zellkonverter looks promising, but I don't think I can use it. Ideally, I want to write a function like this:
function(mat, sink) {
grid <- colAutoGrid(mat)
for (i in seq_len(length(grid))) {
vp <- grid[[i]]
x <- read_block(mat, vp, is_sparse(mat))
# do some processing
x <- f(x)
write_block(sink, vp, x)
}
return(as(sink, "DelayedArray"))
}
And have it work with both H5AD and TileDB files, along with their respective sinks. I can use HDF5RealizationSink
for H5AD files, but that writes a dense matrix on-disk, which ends up being cumbersome to use in downstream steps. I have no need for DelayedArray/HDF5Array to handle anything other than /X
or /layers
from an H5AD file and I agree that extra information should be in a higher-level package like zellkonverter
As for whether an H5AD file with only /X
, /obs/_index
, and /var/_index
is valid, it would appear so
In my experience, AnnData is pretty robust to incomplete H5AD files, and is capable of filling in the gaps when needed
@LTLA Yeah that's what I'm offering: to focus on block-wise deposition of the count data into the file. Whether that will be done thru something called H5SparseRealizationSink or H5ADRealizationSink is implementation detail at this point. It's actually very possible that I will introduce more than one RealizationSink subclass for this. Didn't really think about the details yet.
@mojaveazure Good to know about a minimalist H5AD file with only /X
, /obs/_index
, and /var/_index
. Thanks for providing the Python code. So I think you have valid feature request. Now I'll need to be able to find some time to work on this. Can't promise anything at this point.
Awesome, thank you! I'd also be happy to help in any way I can
Hello,
I would like to be able to write a sparse matrix to an H5AD file in a blocked manner using
write_block()
; I know we can do this withHDF5RealizationSink
but that results in a dense matrix, which limits downstream functionality. As H5AD files have a specific format for sparse matrices on disk, a new H5AD-specific sink should be able to handle this, unless I'm missing something?I'd be happy to help build this out, but I'm not familiar with the internals of
RealizationSink
so I'm not sure how much I can meaningfully contribute