Bioconductor / HDF5Array

HDF5 backend for DelayedArray objects
https://bioconductor.org/packages/HDF5Array
9 stars 13 forks source link

Expose dimension and format options for the H5SparseMatrixSeed constructor #54

Closed LTLA closed 1 year ago

LTLA commented 1 year ago

Looking at the H5SparseMatrixSeed source code, it seems like it would be straightforward to allow users to specify dim and ans_class, rather than extracting them from the file. This would allow the H5SparseMatrixSeed constructor to work with any compressed sparse matrix stored in a HDF5 group that has data, indices and indptr, provided that the user can specify the dimensions and the row/column layout. (Of course, if these are not supplied, then they can be automatically inferred.)

This request is motivated by the desire to avoid the H5AD formats, which are very confusing to explain in an R context, e.g., a matrix labelled as a csr_matrix inside the file is instead a CSC matrix in R. If my application already knows that a matrix is CSC/R, then I can just pass that knowledge directly to the constructor, rather than doing a mental double transposition to trick H5SparseMatrixSeed into doing the right thing. (One transposition for data writers to label the CSC matrix as csr, and then another transposition for data readers - not necessarily using HDF5Array - to undo the transposition to load csr as CSC.)

hpages commented 1 year ago

Sounds good. Would it be ok if the user specified the layout ("CSC" or "CSR") instead of the class to return ("CSC_H5SparseMatrixSeed" or "CSR_H5SparseMatrixSeed")?

hpages commented 1 year ago

Done in HDF5Array 1.29.3 (see 41fe4b17c7822a1d29f0bf03d89c79aabce94bcc).

Does that work?

LTLA commented 1 year ago

Thanks, that looks great. Need to update my local R libs to check it out, but it's pretty much what I was thinking anyway.