bnprks / BPCells

Scaling Single Cell Analysis to Millions of Cells
https://bnprks.github.io/BPCells
Other
134 stars 11 forks source link

Try solving the issue "BPCells to H5AD" #76

Closed ycli1995 closed 3 months ago

ycli1995 commented 4 months ago

Hi, @bnprks. For issue https://github.com/bnprks/BPCells/issues/49#issuecomment-1932869295, when I force the H5NumWriter<uint64_t> to write int64_t into HDF5 dataset, the .h5ad file can work normally with python sc.read_h5ad. See the branch https://github.com/ycli1995/BPCells/tree/anndata.

Below is my example:

{r}
devtools::with_debug(devtools::load_all("/develop/ycli1995/BPCells"))

# open an old 10x HDF5 with specified 'group'
mat1 <- open_matrix_10x_hdf5("/develop/1M_neurons_filtered_gene_bc_matrices_h5.h5", group = "mm10")

matrix_type(mat1)  ## uint32_t

mat1 <- mat1[1:200, 1:100]
mat1 <- convert_matrix_type(mat1, "double")
mat1 <- write_matrix_anndata_hdf5(mat1, "sct_counts.h5")
{python}
>>> import anndata as ad
>>> adata = ad.read_h5ad("sct_counts.h5")
>>> adata[0:9, 0:9]
View of AnnData object with n_obs × n_vars = 9 × 9
    obs: 'bpcells_name'
    var: 'bpcells_name'

>>> adata[0:9, 0:9].X
<9x9 sparse matrix of type '<class 'numpy.float64'>'
        with 5 stored elements in Compressed Sparse Row format>

>>> adata.X.indptr
array([   0,   10,   17,   30,   42,   63,   91,  101,  117,  129,  147,
        169,  187,  195,  214,  238,  259,  279,  320,  341,  346,  366,
        389,  403,  413,  423,  432,  445,  458,  481,  509,  521,  529,
        558,  573,  587,  603,  623,  631,  641,  648,  655,  659,  679,
        693,  708,  714,  725,  736,  749,  762,  772,  779,  801,  814,
        826,  845,  852,  864,  880,  899,  908,  931,  944,  951,  963,
        973,  991, 1003, 1025, 1031, 1045, 1062, 1071, 1085, 1099, 1127,
       1139, 1153, 1172, 1193, 1227, 1257, 1273, 1281, 1295, 1311, 1320,
       1332, 1348, 1356, 1371, 1381, 1392, 1410, 1422, 1432, 1438, 1450,
       1461, 1468])

>>> adata.X.indices
array([ 13,  29,  37, ...,  68, 131, 188], dtype=uint32)

>>> adata.X.indptr.dtype
dtype('int64')

As you can see, when indptr is ensured to be int64, subseting adata works.

Of course, we shouldn't just change the behaviors of H5NumWriter<uint64_t> in such arbitrary manners. Therefore, I'm thinking of the following options:

I'm also thinking whether we actually need uint64 for indptr in a CSC matrix. In other words, do we really expect that the total number of non-zero values could be out of the upper bound for int64?

bnprks commented 4 months ago

Thanks for running this experiment, very handy to know. You're right that there's no real reason to use uint64 over int64, especially on disk. I think the easiest way to address this would be to add NumReader and NumWriter support for int32_t and int64_t, then do something like wb.createLongWriter("indptr").convert<uint64_t>() just within the AnnData (and probably 10x) readers/writers. That way we can be more compatible with AnnData and 10x without altering any of the internals of how StoredMatrixWriter works.

If this is something you'd like to take up, you'll need to add two functions to the ReaderBuilder and WriterBuilder interfaces, along with implementations for HDF5 files. (For the other reader/writer storage types, I'd be fine just putting in stubs that throw an exception saying it's not implemented yet). While we're at it it might make sense to rename Int -> Int32 and Long to Int64, etc.