bnprks / BPCells

Scaling Single Cell Analysis to Millions of Cells
https://bnprks.github.io/BPCells
Other
167 stars 17 forks source link

Different gene expression count values when converting BPCells matrix #160

Open sbt2024 opened 1 week ago

sbt2024 commented 1 week ago

I save count values from seurat object as:

sc_counts <- GetAssayData(object = sc, assay = "RNA, layer = "data")

The sc_counts object is BPCells object of class RenameDims. The object has 4 embedded matrices that could be accessed by "@" operator as I am showing in my code below.

I tried accessing the count matrix using two approaches below but I am getting slightly different count values for some genes, I am wondering what is the difference between these two ways and which one is the correct way to extract counts?

Approach 1: sc_counts_dgCMatrix <- as(sc_counts, "dgCMatrix")

Approach 2: sc_counts_dgCMatrix <- as(sc_counts@matrix@matrix@matrix@matrix, "dgCMatrix")

bnprks commented 1 week ago

Hi @sbt2024, thanks for your question.

With normal Seurat conventions, accessing the data layer as you do would return log-normalized data, not raw counts. If you want raw counts, the best way to do that would probably be sc_counts <- GetAssayData(object = sc, assay = "RNA, layer = "counts"). Then the direct conversion of as(sc_counts, "dgCMatrix") would be the recommended way to convert to an in-memory matrix if needed.

As BPCells can queue up several operations to be performed on-the-fly when loading data from disk, the repeated @matrix accesses you have tried out will strip away later staged operations (which might include, e.g. log normalization steps). It's mostly fine to access inner (less processed) data that way, though its easier to run into problems than just loading the counts layer directly out of Seurat.

When you print out a BPCells matrix in an interactive session it will show what the "staged operations" are, which can make it easier to tell what's going on.

My guess is that your Approach 2 is stripping away some math operations that are being applied in Approach 1, though if this is not the case just let me know what staged operations are printed out for each of these matrix versions and I could help assess why the values are different.

sbt2024 commented 1 week ago

Hi @bnprks, thanks for your response. I was actually intentionally accessing the data layer to obtain the log-normalized data. Thanks for confirming that Approach 1 is the right way to go. You brought a good question though (in your second paragraph) regarding the BPCells operations - so even though I am extracting the normalized counts from seurat object, I assume the direct conversion of the log-normalized data are handled appropriately using the command as(sc_log_data, "dgCMatrix") without the re-normalization concern? In other words, does the conversion function works with the normalized data as I am calling it or does it expect raw counts?

bnprks commented 1 week ago

Yes, as(bpcells_mat, "dgCMatrix") should work as expected whether you have counts, log-normalized data, or any other set of transformations on the matrix.

You can mostly think of BPCells matrix objects as equivalent to in-memory matrices -- although BPCells doesn't support every matrix operation, if the operation doesn't immediately print an error/warning about incompatibility then it should work equivalently to an in-memory matrix.