Closed ycli1995 closed 8 months ago
Thanks for running this experiment, very handy to know. You're right that there's no real reason to use uint64 over int64, especially on disk. I think the easiest way to address this would be to add NumReader
and NumWriter
support for int32_t and int64_t, then do something like wb.createLongWriter("indptr").convert<uint64_t>()
just within the AnnData (and probably 10x) readers/writers. That way we can be more compatible with AnnData and 10x without altering any of the internals of how StoredMatrixWriter
works.
If this is something you'd like to take up, you'll need to add two functions to the ReaderBuilder
and WriterBuilder
interfaces, along with implementations for HDF5 files. (For the other reader/writer storage types, I'd be fine just putting in stubs that throw an exception saying it's not implemented yet). While we're at it it might make sense to rename Int -> Int32 and Long to Int64, etc.
Hi, @bnprks. For issue https://github.com/bnprks/BPCells/issues/49#issuecomment-1932869295, when I force the
H5NumWriter<uint64_t>
to writeint64_t
into HDF5 dataset, the .h5ad file can work normally with pythonsc.read_h5ad
. See the branch https://github.com/ycli1995/BPCells/tree/anndata.Below is my example:
As you can see, when
indptr
is ensured to beint64
, subsetingadata
works.Of course, we shouldn't just change the behaviors of
H5NumWriter<uint64_t>
in such arbitrary manners. Therefore, I'm thinking of the following options:ValueError
in https://github.com/bnprks/BPCells/issues/49#issuecomment-1932869295 came from thatscipy.sparse.csr_matrix
does not support indexed byuint64
inindptr
, so we may just change nothing but raise an issue inscipy
's repository and wait for the updated release which fixes the issue.StoredMatrixWriter
to handle the behavior of writingindptr
when it meets H5AD.I'm also thinking whether we actually need
uint64
forindptr
in a CSC matrix. In other words, do we really expect that the total number of non-zero values could be out of the upper bound forint64
?