Try solving the issue "BPCells to H5AD"

Hi, @bnprks. For issue https://github.com/bnprks/BPCells/issues/49#issuecomment-1932869295, when I force the H5NumWriter<uint64_t> to write int64_t into HDF5 dataset, the .h5ad file can work normally with python sc.read_h5ad. See the branch https://github.com/ycli1995/BPCells/tree/anndata.

Below is my example:

{r}
devtools::with_debug(devtools::load_all("/develop/ycli1995/BPCells"))

# open an old 10x HDF5 with specified 'group'
mat1 <- open_matrix_10x_hdf5("/develop/1M_neurons_filtered_gene_bc_matrices_h5.h5", group = "mm10")

matrix_type(mat1)  ## uint32_t

mat1 <- mat1[1:200, 1:100]
mat1 <- convert_matrix_type(mat1, "double")
mat1 <- write_matrix_anndata_hdf5(mat1, "sct_counts.h5")

{python}
>>> import anndata as ad
>>> adata = ad.read_h5ad("sct_counts.h5")
>>> adata[0:9, 0:9]
View of AnnData object with n_obs × n_vars = 9 × 9
    obs: 'bpcells_name'
    var: 'bpcells_name'

>>> adata[0:9, 0:9].X
<9x9 sparse matrix of type '<class 'numpy.float64'>'
        with 5 stored elements in Compressed Sparse Row format>

>>> adata.X.indptr
array([   0,   10,   17,   30,   42,   63,   91,  101,  117,  129,  147,
        169,  187,  195,  214,  238,  259,  279,  320,  341,  346,  366,
        389,  403,  413,  423,  432,  445,  458,  481,  509,  521,  529,
        558,  573,  587,  603,  623,  631,  641,  648,  655,  659,  679,
        693,  708,  714,  725,  736,  749,  762,  772,  779,  801,  814,
        826,  845,  852,  864,  880,  899,  908,  931,  944,  951,  963,
        973,  991, 1003, 1025, 1031, 1045, 1062, 1071, 1085, 1099, 1127,
       1139, 1153, 1172, 1193, 1227, 1257, 1273, 1281, 1295, 1311, 1320,
       1332, 1348, 1356, 1371, 1381, 1392, 1410, 1422, 1432, 1438, 1450,
       1461, 1468])

>>> adata.X.indices
array([ 13,  29,  37, ...,  68, 131, 188], dtype=uint32)

>>> adata.X.indptr.dtype
dtype('int64')

As you can see, when indptr is ensured to be int64, subseting adata works.

Of course, we shouldn't just change the behaviors of H5NumWriter<uint64_t> in such arbitrary manners. Therefore, I'm thinking of the following options:

A. It seems that the ValueError in https://github.com/bnprks/BPCells/issues/49#issuecomment-1932869295 came from that scipy.sparse.csr_matrix does not support indexed by uint64 in indptr, so we may just change nothing but raise an issue in scipy's repository and wait for the updated release which fixes the issue.
B. Add a sub-class or something that inherits StoredMatrixWriter to handle the behavior of writing indptr when it meets H5AD.

I'm also thinking whether we actually need uint64 for indptr in a CSC matrix. In other words, do we really expect that the total number of non-zero values could be out of the upper bound for int64?

Thanks for running this experiment, very handy to know. You're right that there's no real reason to use uint64 over int64, especially on disk. I think the easiest way to address this would be to add NumReader and NumWriter support for int32_t and int64_t, then do something like wb.createLongWriter("indptr").convert<uint64_t>() just within the AnnData (and probably 10x) readers/writers. That way we can be more compatible with AnnData and 10x without altering any of the internals of how StoredMatrixWriter works.

If this is something you'd like to take up, you'll need to add two functions to the ReaderBuilder and WriterBuilder interfaces, along with implementations for HDF5 files. (For the other reader/writer storage types, I'd be fine just putting in stubs that throw an exception saying it's not implemented yet). While we're at it it might make sense to rename Int -> Int32 and Long to Int64, etc.

bnprks / BPCells

Try solving the issue "BPCells to H5AD" #76