bnprks / BPCells

Scaling Single Cell Analysis to Millions of Cells
https://bnprks.github.io/BPCells
Other
134 stars 11 forks source link

BPCells count format #82

Open sbt2024 opened 3 months ago

sbt2024 commented 3 months ago

I have recently seen that the counts generated from BPCells have "RenameDims" type (BPCells::RenameDims) unlike what I previously have seen as "MatrixSubset" type (BPCells::MatrixSubset). I believe this because I have recently updated the BPCells package to the most current version. This new count type however has caused some issues that I never faced before when using BPCells with Seurat. I have just reported this to the Seurat team as well (https://github.com/satijalab/seurat/issues/8723), but I am not sure if the main cause of the issue is from BPCells or Seurat. I will appreciate any insight you might have. Thank you.

bnprks commented 3 months ago

Thanks for the report -- I don't think this issue is related to the "RenameDims" change, but it may be related to some recent changes in the reading of 10x hdf5 files:

Quoting the error from your linked issue for reference:

> sc_v5_test2 <- JoinLayers(sc_v5_test2) Error in rbind2(argl[[i]], r) : Cannot merge matrices with different data type. Please use convert_matrix_type().

This error is not referring to the R class (i.e. MatrixSubset or RenameDims), but is referring to the type of data in the matrix (typically "uint32_t" for counts data, and "double" for normalized values). The type of a matrix is visible in the printout summary BPCells provides as a line reading e.g. Data type: double. There's also an unexported function to get the matrix type which you can access with the triple colon syntax: BPCells:::matrix_type(m).

To solve this issue, you would need to set all the matrices to be the same type, using the convert_matrix_type() function prior to merging. If you're unsure about whether your inputs are all counts data, I'd do m <- convert_matrix_type(m, "double") for all your matrices, as it will not lose information and will be a no-op for matrices that don't need any type conversion applied. This won't modify anything on-disk but will let BPCells know to do the conversion on-the-fly when data is read.

As for the cause of this change, the only thing I can think of is recent changes in handling of 10x HDF5 matrix formats that were written by a non-cellranger tool which will now auto-detect the input data type rather than assuming the uint32_t type that cellranger outputs. It's possible the defaults have also shifted for AnnData inputs though that would have been longer ago. Does that by any chance describe your situation?

sbt2024 commented 3 months ago

Thanks a lot - that seems to have solved the issue. Another issue popped up while splitting the layers but I think it is most likely related to Seurat so I will try to post it there. Just FYI here: > sc_v5_test2[["RNA"]] <- split(sc_v5_test2[["RNA"]], f = sc_v5_test2$batch) Splitting 'counts', 'data' layers. Not splitting . If you would like to split other layers, set inlayersargument. I actually checked the sc_v5_test2 object after running that command and I see the object is split just fine so I am not sure why that message is being generated. Thanks again for prompt response!