Bioconductor / HDF5Array

HDF5 backend for DelayedArray objects
https://bioconductor.org/packages/HDF5Array
9 stars 13 forks source link

Feature Request: Support for writing sparse matrices to H5AD files #50

Open mojaveazure opened 1 year ago

mojaveazure commented 1 year ago

Hello,

I would like to be able to write a sparse matrix to an H5AD file in a blocked manner using write_block(); I know we can do this with HDF5RealizationSink but that results in a dense matrix, which limits downstream functionality. As H5AD files have a specific format for sparse matrices on disk, a new H5AD-specific sink should be able to handle this, unless I'm missing something?

I'd be happy to help build this out, but I'm not familiar with the internals of RealizationSink so I'm not sure how much I can meaningfully contribute

LTLA commented 1 year ago

zellkonverter has some code for this:

https://github.com/theislab/zellkonverter/blob/f9303c4695221060e551ddffe2e40e58aeb66a0e/R/write.R#L158-L242

I use a variant of this for some in-house applications.

hpages commented 1 year ago

Hi @mojaveazure,

I suppose HDF5Array could provide an H5ADRealizationSink class similar to the existing TENxRealizationSink class. The latter can be used to write blocks of a sparse matrix to the 10x Genomics sparse format. It is used internally by writeTENxMatrix().

I just want to emphasize the fact that, like for the other data writing capabilities in HDF5Array, this new functionality would focus on writing the count data to disk, plus possibly the rownames (/var/_index in the h5ad file) and colnames (/obs/_index in the h5ad file) if present. This is because HDF5Array is meant to be a low-level package so all the other things that are typically found in an h5ad file would need to be taken care of by some higher-level functionalities implemented somewhere else.

I'm not too familiar with the h5ad format but is an h5ad file with only the /X group plus /obs/_index and /var/_index datasets considered valid? This is what the user would end up with when writing to a new file.

I was also going to mention zellkonverter but I see that @LTLA just did this while I was typing my answer. I only knew about zellkonverter::readH5AD() and zellkonverter::writeH5AD() though, to read/write a SingleCellExperiment object from/to an h5ad file. But it seems that they also have low-level utilities for writing data by block so maybe that's it :smile:

H.

LTLA commented 1 year ago

IMO a worthwhile functionality would be more like a H5SparseRealizationSink that could be re-used in higher-level packages like zellkonverter. Then HDF5Array just needs to concern itself with the block-wise deposition of CSC/R content into a HDF5 file, while higher-level packages like zellkonverter can worry about the formatting miscellany.

mojaveazure commented 1 year ago

Zellkonverter looks promising, but I don't think I can use it. Ideally, I want to write a function like this:

function(mat, sink) {
  grid <- colAutoGrid(mat)
  for (i in seq_len(length(grid))) {
    vp <- grid[[i]]
    x <- read_block(mat, vp, is_sparse(mat))
    # do some processing
    x <- f(x)
    write_block(sink, vp, x)
  }
  return(as(sink, "DelayedArray"))
}

And have it work with both H5AD and TileDB files, along with their respective sinks. I can use HDF5RealizationSink for H5AD files, but that writes a dense matrix on-disk, which ends up being cumbersome to use in downstream steps. I have no need for DelayedArray/HDF5Array to handle anything other than /X or /layers from an H5AD file and I agree that extra information should be in a higher-level package like zellkonverter

mojaveazure commented 1 year ago

As for whether an H5AD file with only /X, /obs/_index, and /var/_index is valid, it would appear so

script hidden for brevity ```python #!/usr/bin/env python3 import h5py import random import anndata from string import ascii_lowercase from scipy.sparse import csr_matrix from numpy.random import random_integers print(anndata.__version__) # 0.8.0 # Create some random data x = random_integers(low=0, high=7, size=30) x.shape = (5, 6) mat = csr_matrix(x) cells = [''.join(random.sample(ascii_lowercase, 5)) for _ in range(mat.shape[0])] genes = [''.join(random.sample(ascii_lowercase, 7)) for _ in range(mat.shape[1])] # Create an H5AD file adata = anndata.AnnData(mat) adata.obs_names = cells adata.var_names = genes adata.write_h5ad("adata_test.h5ad") # Remove all groups except X, obs, and var hfile = h5py.File("adata_test.h5ad", mode="r+") items = {x[0] for x in hfile.items()} items.difference_update({'X', 'obs', 'var'}) for it in items: hfile.pop(it) hfile.close() # Check the file adata2 = anndata.read_h5ad("adata_test.h5ad") print(adata2.X != adata.X) print(adata2.obs_names != adata.obs_names) print(adata2.var_names != adata.var_names) # Check backed matrix adata3 = anndata.read_h5ad("adata_test.h5ad", backed=True) print(adata3.X.to_memory() != adata.X) print(adata3.obs_names != adata.obs_names) print(adata3.var_names != adata.var_names) ```

In my experience, AnnData is pretty robust to incomplete H5AD files, and is capable of filling in the gaps when needed

hpages commented 1 year ago

@LTLA Yeah that's what I'm offering: to focus on block-wise deposition of the count data into the file. Whether that will be done thru something called H5SparseRealizationSink or H5ADRealizationSink is implementation detail at this point. It's actually very possible that I will introduce more than one RealizationSink subclass for this. Didn't really think about the details yet.

@mojaveazure Good to know about a minimalist H5AD file with only /X, /obs/_index, and /var/_index. Thanks for providing the Python code. So I think you have valid feature request. Now I'll need to be able to find some time to work on this. Can't promise anything at this point.

mojaveazure commented 1 year ago

Awesome, thank you! I'd also be happy to help in any way I can