bnprks / BPCells

Scaling Single Cell Analysis to Millions of Cells
https://bnprks.github.io/BPCells
Other
166 stars 17 forks source link

Try adding supports for non-integer data for 10x HDF5 #74

Closed ycli1995 closed 8 months ago

ycli1995 commented 8 months ago

Hi, @bnprks. I'm thinking that the data type from 10x HDF5 file should not be assumed as only uint32_t, since that the users may get matrices created by HDF5Array or DropletUtils::write10xCounts, which could contain normalized expression data. Therefore, I add supports for non-integer data for 10xMatrixH5, with the manners similar to IO of anndata.

Sorry for not raising a pull request, since further tests may be still required. I'll let you know when everything's done. https://github.com/ycli1995/BPCells/tree/084fec406ae8f1d80b0c2ffb5d7a535577834eb8

Here's some examples that already work:

devtools::with_debug(devtools::load_all("/develop/ycli1995/BPCells"))

# open an old 10x HDF5 with specified 'group'
mat1 <- open_matrix_10x_hdf5("/develop/1M_neurons_filtered_gene_bc_matrices_h5.h5", group = "mm10")

matrix_type(mat1)  ## uint32_t

mat1 <- mat1[1:200, 1:100]
mat1 <- Seurat::NormalizeData(mat1)
mat1 <- write_matrix_10x_hdf5(mat1, "/develop/ycli1995/BPCells.testwrapper/mat1.h5", group = "matrix")
mat1
matrix_type(mat1) ## double

mat2 <- Seurat::Read10X_h5("/develop/ycli1995/BPCells.testwrapper/mat1.h5")
mat1 <- as(mat1, "dgCMatrix")
mat2[1:10, 1:10]  ## Keep double-precision floating-point format
identical(mat1, mat2)

# Write two matrix into one HDF5 file in 10x formats
mat1 <- write_matrix_10x_hdf5(
  as(mat1, "IterableMatrix"),
  "/develop/ycli1995/BPCells.testwrapper/mat.h5", group = "mat1"
)
mat2 <- write_matrix_10x_hdf5(
  as(mat2, "IterableMatrix"),
  "/develop/ycli1995/BPCells.testwrapper/mat.h5", group = "mat2"
)
bnprks commented 8 months ago

Hi Yuchen, this sounds good overall, though one area I'm unsure about is how best to handle defaults. My guiding philosophy here is the robustness principle of "be conservative in what you send, be liberal in what you accept".

So I think when calling open_matrix_10x_hdf5, it would be a good improvement auto-detect the data type in the file if possible so we will work smoothly with files output from other tools (even if the data types mismatch what the 10x cellranger pipeline outputs). If auto-detection is not possible, having a type argument might work.

However, when calling write_matrix_10x_hdf5, I think we need to be careful to avoid defaults that could result in accidental non-integer outputs that might be incompatible with other tools designed to accept 10x hdf5 inputs. This issue comment is an example of where my choice to flexibly write h5ad matrix types has caused downstream issues since scanpy can only read float-type anndata matrices, so I'd like to avoid a repeat of that if possible.

My suggested interface changes for write_matrix_10x_hdf might be:

  1. If a user's input matrix is not integer type, default to converting to integer as now but also output a message the first time this happens to alert them to the conversion and how to avoid the conversion. This code shows a useful function to do a warning message that will only appear once every 6 hours.
  2. To allow opting-in to non-integer outputs, I'd suggest adding an argument type = c("uint32_t", "double", "float", "auto"), specifying the type of the output with auto just outputting the existing matrix type.

I hope this suggestion won't be too clunky an interface from your perspective. Let me know if you have other proposals for how the arguments to write_matrix_10x_hdf or open_matrix_10x_hdf5 should change. I agree it will be nice to enable non-integer 10x matrices for the cases when users want them so long as it won't trip up users relying on the defaults.

Thanks for your interest in this and your efforts to improve the code for everyone! As always I'd be happy to answer code questions or provide pointers if you're unsure of anything on the C++ side -- github or email is fine

ycli1995 commented 8 months ago

However, when calling write_matrix_10x_hdf5, I think we need to be careful to avoid defaults that could result in accidental non-integer outputs that might be incompatible with other tools designed to accept 10x hdf5 inputs. This issue comment is an example of where my choice to flexibly write h5ad matrix types has caused downstream issues since scanpy can only read float-type anndata matrices, so I'd like to avoid a repeat of that if possible.

Hi, @bnprks. This issue could be due to anndata version. As far as I know, anndata has changed the default data type from float32 to float64 in recent release. I've tried anndata 0.10.3 to read the matrix output from my example above, and it worked without exception.

{r}
mat1 <- Seurat::Read10X_h5("/develop/ycli1995/BPCells.testwrapper/mat1.h5")
mat1 <- write_matrix_10x_hdf5(
as(mat1, "IterableMatrix"),
"/develop/ycli1995/BPCells.testwrapper/mat.h5", group = "matrix",
feature_metadata = list(genome = rep("mm10", nrow(mat1)))  ## need genome
)
{python}
adata = sc.read_10x_h5("mat.h5")
adata.X[0:9, 0:9].toarray()
array([[0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 6.03468367],
       [0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 5.95484046, 5.26428318],
       [0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 7.01401539, 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 6.16791573, 0.        ]])

Still, your advices are right. I think allowing users to choose data type may be a better idea, and BPCells should just provide all kinds of operations that users might need. Anyway, I'll do more exploration on the interface design.

ycli1995 commented 8 months ago

Hi, @bnprks. I finally added a type argument to control the output data type for write_matrix_10x_hdf5. A pull request will be raised soon so I'll close this issue currently. Thanks for your advices above.

BTW, for https://github.com/bnprks/BPCells/issues/49#issuecomment-1932869295, I think the cause of ValueError could be the adata.X.indices or adata.X.indptr:

>>> adata.X.indices
array([ 13,  29,  37, ...,  68, 131, 188], dtype=uint32)
>>> adata.X.indptr
array([   0,   10,   17,   30,   42,   63,   91,  101,  117,  129,  147,
        169,  187,  195,  214,  238,  259,  279,  320,  341,  346,  366,
        389,  403,  413,  423,  432,  445,  458,  481,  509,  521,  529,
        558,  573,  587,  603,  623,  631,  641,  648,  655,  659,  679,
        693,  708,  714,  725,  736,  749,  762,  772,  779,  801,  814,
        826,  845,  852,  864,  880,  899,  908,  931,  944,  951,  963,
        973,  991, 1003, 1025, 1031, 1045, 1062, 1071, 1085, 1099, 1127,
       1139, 1153, 1172, 1193, 1227, 1257, 1273, 1281, 1295, 1311, 1320,
       1332, 1348, 1356, 1371, 1381, 1392, 1410, 1422, 1432, 1438, 1450,
       1461, 1468], dtype=uint64)

>>> adata.X[0:9, 0:9]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.10/dist-packages/scipy/sparse/_index.py", line 66, in __getitem__
    return self._get_sliceXslice(row, col)
  File "/usr/local/lib/python3.10/dist-packages/scipy/sparse/_compressed.py", line 664, in _get_sliceXslice
    return self._get_submatrix(major, minor, copy=True)
  File "/usr/local/lib/python3.10/dist-packages/scipy/sparse/_compressed.py", line 809, in _get_submatrix
    indptr, indices, data = get_csr_submatrix(
ValueError: unsupported data types in input

The uint32 and uint64 for indices may be unsupported. I'll try modifying write_matrix_anndata_hdf5 to make sure the two index types be int32 and int64, and see if the issue persists.