Closed ycli1995 closed 8 months ago
Hi Yuchen, this sounds good overall, though one area I'm unsure about is how best to handle defaults. My guiding philosophy here is the robustness principle of "be conservative in what you send, be liberal in what you accept".
So I think when calling open_matrix_10x_hdf5
, it would be a good improvement auto-detect the data type in the file if possible so we will work smoothly with files output from other tools (even if the data types mismatch what the 10x cellranger pipeline outputs). If auto-detection is not possible, having a type argument might work.
However, when calling write_matrix_10x_hdf5
, I think we need to be careful to avoid defaults that could result in accidental non-integer outputs that might be incompatible with other tools designed to accept 10x hdf5 inputs. This issue comment is an example of where my choice to flexibly write h5ad matrix types has caused downstream issues since scanpy can only read float-type anndata matrices, so I'd like to avoid a repeat of that if possible.
My suggested interface changes for write_matrix_10x_hdf
might be:
type = c("uint32_t", "double", "float", "auto")
, specifying the type of the output with auto
just outputting the existing matrix type.I hope this suggestion won't be too clunky an interface from your perspective. Let me know if you have other proposals for how the arguments to write_matrix_10x_hdf
or open_matrix_10x_hdf5
should change. I agree it will be nice to enable non-integer 10x matrices for the cases when users want them so long as it won't trip up users relying on the defaults.
Thanks for your interest in this and your efforts to improve the code for everyone! As always I'd be happy to answer code questions or provide pointers if you're unsure of anything on the C++ side -- github or email is fine
However, when calling
write_matrix_10x_hdf5
, I think we need to be careful to avoid defaults that could result in accidental non-integer outputs that might be incompatible with other tools designed to accept 10x hdf5 inputs. This issue comment is an example of where my choice to flexibly write h5ad matrix types has caused downstream issues since scanpy can only read float-type anndata matrices, so I'd like to avoid a repeat of that if possible.Hi, @bnprks. This issue could be due to anndata version. As far as I know,
anndata
has changed the default data type fromfloat32
tofloat64
in recent release. I've tried anndata 0.10.3 to read the matrix output from my example above, and it worked without exception.{r} mat1 <- Seurat::Read10X_h5("/develop/ycli1995/BPCells.testwrapper/mat1.h5") mat1 <- write_matrix_10x_hdf5( as(mat1, "IterableMatrix"), "/develop/ycli1995/BPCells.testwrapper/mat.h5", group = "matrix", feature_metadata = list(genome = rep("mm10", nrow(mat1))) ## need genome )
{python}
adata = sc.read_10x_h5("mat.h5")
adata.X[0:9, 0:9].toarray()
array([[0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 6.03468367],
[0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. ,
0. , 0. , 5.95484046, 5.26428318],
[0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. ,
0. , 0. , 7.01401539, 0. ],
[0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. ,
0. , 0. , 6.16791573, 0. ]])
Still, your advices are right. I think allowing users to choose data type may be a better idea, and BPCells should just provide all kinds of operations that users might need. Anyway, I'll do more exploration on the interface design.
Hi, @bnprks. I finally added a type
argument to control the output data type for write_matrix_10x_hdf5
. A pull request will be raised soon so I'll close this issue currently. Thanks for your advices above.
BTW, for https://github.com/bnprks/BPCells/issues/49#issuecomment-1932869295, I think the cause of ValueError
could be the adata.X.indices
or adata.X.indptr
:
>>> adata.X.indices
array([ 13, 29, 37, ..., 68, 131, 188], dtype=uint32)
>>> adata.X.indptr
array([ 0, 10, 17, 30, 42, 63, 91, 101, 117, 129, 147,
169, 187, 195, 214, 238, 259, 279, 320, 341, 346, 366,
389, 403, 413, 423, 432, 445, 458, 481, 509, 521, 529,
558, 573, 587, 603, 623, 631, 641, 648, 655, 659, 679,
693, 708, 714, 725, 736, 749, 762, 772, 779, 801, 814,
826, 845, 852, 864, 880, 899, 908, 931, 944, 951, 963,
973, 991, 1003, 1025, 1031, 1045, 1062, 1071, 1085, 1099, 1127,
1139, 1153, 1172, 1193, 1227, 1257, 1273, 1281, 1295, 1311, 1320,
1332, 1348, 1356, 1371, 1381, 1392, 1410, 1422, 1432, 1438, 1450,
1461, 1468], dtype=uint64)
>>> adata.X[0:9, 0:9]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.10/dist-packages/scipy/sparse/_index.py", line 66, in __getitem__
return self._get_sliceXslice(row, col)
File "/usr/local/lib/python3.10/dist-packages/scipy/sparse/_compressed.py", line 664, in _get_sliceXslice
return self._get_submatrix(major, minor, copy=True)
File "/usr/local/lib/python3.10/dist-packages/scipy/sparse/_compressed.py", line 809, in _get_submatrix
indptr, indices, data = get_csr_submatrix(
ValueError: unsupported data types in input
The uint32
and uint64
for indices may be unsupported. I'll try modifying write_matrix_anndata_hdf5
to make sure the two index types be int32
and int64
, and see if the issue persists.
Hi, @bnprks. I'm thinking that the data type from 10x HDF5 file should not be assumed as only
uint32_t
, since that the users may get matrices created byHDF5Array
orDropletUtils::write10xCounts
, which could contain normalized expression data. Therefore, I add supports for non-integer data for10xMatrixH5
, with the manners similar to IO ofanndata
.Sorry for not raising a pull request, since further tests may be still required. I'll let you know when everything's done. https://github.com/ycli1995/BPCells/tree/084fec406ae8f1d80b0c2ffb5d7a535577834eb8
Here's some examples that already work: