bnprks / BPCells

Scaling Single Cell Analysis to Millions of Cells
https://bnprks.github.io/BPCells
Other
166 stars 17 forks source link

problem with BPCells and PCA/Dim Reductions #130

Closed fingeram closed 1 month ago

fingeram commented 2 months ago

Hi,

I am having an issue with RunPCA and running dimensionality reduction from a seurat object generated with BP Cells. I follow the general protocol for writing the matrix directory and creating the seurat object. As far as I can tell the seurat object and count slot looks as expected. However, when I reach the step of RunPCA and generate an Elbow plot, it looks wrong and I do not understand why. I am already working with the development version of seurat. The dataset I am working with has about 1.3M cells that's why I am using the sketch assay. However, I also applied the same steps with a subsampled seurat object without sketch assay and the issue is the same. Hence I believe it must be some sort of incompatibility between BPCells and Seurat v5.

import raw data from h5ad file

raw <- open_matrix_anndata_hdf5(path="/novo/projects/departments/compbio/sysbio/Projects/mouse_liver_models/single_cell_and_nuclei/concatenated.dir/concatenated.h5ad") raw <- convert_matrix_type(raw, type = "uint32_t") #must convert count matrix from type float (non-integer) to integer values

write matrix directory for on-disk storage

write_matrix_dir(mat = raw, dir = "/novo/projects/shared_projects/liver_biology_colab/people/aqnf/mouse_sc_sn_AQNF_June24/BPcells/mouse_counts")

load data from matrix directory to generate seurat object

raw.mat <- open_matrix_dir(dir = "/novo/projects/shared_projects/liver_biology_colab/people/aqnf/mouse_sc_sn_AQNF_June24/BPcells/mouse_counts")

raw.mat 33696 x 1328118 IterableMatrix object with class MatrixDir

Row names: Xkr4, Gm1992 ... ENSMUSG00000095041 Col names: AAACCCAAGCCTGAGA-97, AAACCCAGTCGTACAT-97 ... TTTGTTGTCTGCATGA-96

Data type: uint32_t Storage order: column major

Queued Operations:

  1. Load compressed matrix from directory /novo/projects/shared_projects/liver_biology_colab/people/aqnf/mouse_sc_sn_AQNF_June24/BPcells/mouse_counts

generate Seurat object from on-disk matrix

sobj <- CreateSeuratObject(counts = raw.mat)

sobj@assays$RNA$counts 33696 x 1328118 IterableMatrix object with class RenameDims

Row names: Xkr4, Gm1992 ... ENSMUSG00000095041 Col names: AAACCCAAGCCTGAGA-97, AAACCCAGTCGTACAT-97 ... TTTGTTGTCTGCATGA-96

Data type: uint32_t Storage order: column major

Queued Operations:

  1. Load compressed matrix from directory /novo/projects/shared_projects/liver_biology_colab/people/aqnf/mouse_sc_sn_AQNF_June24/BPcells/mouse_counts
  2. Reset dimnames

sobj <- NormalizeData(sobj) sobj <- FindVariableFeatures(sobj)

sobj.sketch <- SketchData( object = sobj, ncells = 50000, method = "LeverageScore", sketched.assay = "sketch")

DefaultAssay(sobj.sketch) <- "sketch"

sobj.sketch <- FindVariableFeatures(sobj.sketch) sobj.sketch <- ScaleData(sobj.sketch) sobj.sketch <- RunPCA(sobj.sketch)

ElbowPlot(sobj.sketch, ndims = 50, reduction = 'pca') Capture

bnprks commented 2 months ago

Hi @fingeram, thanks for your question. My best guess at what is going wrong is that maybe BPCells has mis-interpreted some floats as ints or vice-versa. This can sometimes result in numbers that are absurdly large, as you're seeing with your PCA standard deviation here.

I've tested your code on the SeuratData pbmc3k demo dataset and it works fine, leading me to think the problem is likely prior to CreateSeuratObject.

To try diagnosing this, I'd double-check the values you're getting in raw.mat to make sure they look reasonable. You can do this, as follows:

vals <- as(raw.mat[1:20,1:20], "dgCMatrix")

Basically you just want to double-check that you should see mostly small integers (probably majority of non-zeros will be 1), rather than some very large numbers.

If you get weird numbers at that stage, could you test around to see at what point in the prep of raw.mat you start getting weird values? If it starts from the point you run open_matrix_anndata_hdf5, maybe you could also share the output of running:

h5ls -v /novo/projects/departments/compbio/sysbio/Projects/mouse_liver_models/single_cell_and_nuclei/concatenated.dir/concatenated.h5ad/X

As a side-note, since BPCells is able to handle very large datasets you don't necessarily need to do sketching just out of worry of having too many cells to run efficiently. But it's fair if you have some other reason for wanting to do sketching that's reasonable.

Let me know what you find with those checks, and you might double-check that the issue doesn't go away if you install a recent (i.e. within the last two months) version of BPCells

-Ben

fingeram commented 2 months ago

Hi @bnprks,

thank you for the quick response! I checked the counts as recommended and to me they look normal (see image). Also when I subsample the dataset and convert the whole count matrix to dgCMatrix class they look normal. Many zeros of course but the non-zero values look fine. That's why it is so hard for me to figure out. Also if I convert the matrix class and then run RunPCA, the Elbowplot looks normal again but my cell embeddings are all NaN.

And the sketching is mostly to save time when computing the dim reductions or running other downstream analysis of the dataset. But you think with BPCells when the count data is on-disk that is not an issue?

Capture

fingeram commented 2 months ago

Hi @bnprks,

to follow up on the example with reducing the size of the dataset by subsampling for a specific cell population. I subsample the full seurat object with the BPCells matrix, then I convert the full count matrix to class dgCMatrix (counts still look fine, see below), I run the downstream processing steps but still the RunPCA generates non-sense output:

#read in the full dataset sobj <- qs::qread("/novo/projects/shared_projects/liver_biology_colab/people/aqnf/mouse_sc_sn_AQNF_June24/data_100mice_bsck/scsn_100mice_reseq_240917_AQNF.qs")

#subset based on experimental condition and cell population sobj.gan <- subset(sobj, Chow == 'GAN')

stromal <- sobj.gan %>% subset(subset = seurat_cluster_harmony %in% c("23", "18", "6", "5", "47")) %>% DietSeurat(assays = c('RNA'))

vals <- as(stromal@assays$RNA$counts[1:20,1:20], "dgCMatrix")

stromal@assays$RNA$counts 33696 x 48010 IterableMatrix object with class RenameDims

Row names: Xkr4, Gm1992 ... ENSMUSG00000095041 Col names: AGCATCAGTCTAACGT-97, AGCGATTAGGCGAAGG-97 ... TGCTTCGAGGCCGCTT-96

Data type: float Storage order: column major

Queued Operations:

  1. Load compressed matrix from directory /nfs_home/projects/shared_projects/liver_biology_colab/data/sc_snRNAseq/bpcells/sc_sn_raw_reseq
  2. Select rows: 1, 2 ... 33696 and cols: 874, 903 ... 1327511
  3. Reset dimnames

vals 20 x 20 sparse Matrix of class "dgCMatrix" [[ suppressing 20 column names ‘AGCATCAGTCTAACGT-97’, ‘AGCGATTAGGCGAAGG-97’, ‘ATCAGGTTCAAGTGGG-97’ ... ]]

Xkr4 . 1 . 1 3 . . 6 . . . . . . . . . . . . Gm1992 . . . . . . . . . . . . . . . . . . . . Gm19938 . . . . . . . . . . . . . . . . . . . . Gm37381 . . . . . . . . . . . . . . . . . . . . Rp1 . . . . . . . . . . . . . . . . . . . . Sox17 . . . . . . . . . . . . . . . . . . . . Gm37587 . . . . . . . . . . . . . . . . . . . . Gm37323 . . . . . . . . . . . . . . . . . . . . Mrpl15 . . 1 1 1 1 . . . . . . . . . . . . . . A930006A01Rik . . . . . . . . . . . . . . . . . . . . Lypla1 1 . 1 . 3 . . . . . . . . . . . . 1 . . Tcea1 . 1 . 1 3 . 1 3 1 . . . . . . . . 1 . . Rgs20 . . . . . . . . . . . . . . . . . . . . Gm16041 . . . . . . . . . . . . . . . . . . . . Atp6v1h . . . . . . . . . . 1 . . . . . . . 2 . Oprk1 . . . . . . . . . . . . . . . . . . . . Npbwr1 . . . . . . . . . . . . . . . . . . . . 4732440D04Rik . . . . . . . . . . . . . . . 1 . . . . Rb1cc1 . 1 1 3 5 . 1 3 . 1 . 1 3 1 . . . . . . Alkal1 . . . . . . . . . . . . . . . . . . . .

#transform dense to sparse matrix stromal[["RNA"]]$counts <- as(stromal[["RNA"]]$counts, "dgCMatrix")

I am readding the dimnames because the slot is empty (but not sure if that is necessary)

stromal@assays[["RNA"]]@layers[["counts"]]@Dimnames[[1]] <- rownames(stromal@assays[["RNA"]]@features) stromal@assays[["RNA"]]@layers[["counts"]]@Dimnames[[2]] <- rownames(stromal@assays[["RNA"]]@cells)

stromal[["RNA"]]$counts 33696 x 48010 sparse Matrix of class "dgCMatrix" [[ suppressing 134 column names ‘AGCATCAGTCTAACGT-97’, ‘AGCGATTAGGCGAAGG-97’, ‘ATCAGGTTCAAGTGGG-97’ ... ]] [[ suppressing 134 column names ‘AGCATCAGTCTAACGT-97’, ‘AGCGATTAGGCGAAGG-97’, ‘ATCAGGTTCAAGTGGG-97’ ... ]]

Xkr4 . 1 . 1 3 . . 6 . . . . . . . . . . . . . . . . . . . . . 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 . . ...... Gm1992 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ...... Gm19938 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ...... Gm37381 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......

.............................. ........suppressing 47876 columns and 33689 rows in show(); maybe adjust options(max.print=, width=) .............................. [[ suppressing 134 column names ‘AGCATCAGTCTAACGT-97’, ‘AGCGATTAGGCGAAGG-97’, ‘ATCAGGTTCAAGTGGG-97’ ... ]]

ENSMUSG00000094855 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ENSMUSG00000095019 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ENSMUSG00000095041 . 3 2 1 5 2 . 1 . . . 1 . . . 1 . . . . . . . 1 . 1 . . . . . . . . . . . . . . . . . 2 . . . . . . . . . . . . . . . . . . . . . . . . . . 1 . 1 1 . 1 . . . 1 . . 1 1 . . . . 1 . 1 . . . . . . . 3 . . 2 . . 1 . 1 . . . . 1 . . . . 1 . . . . . . 1 . . 1 . . 1 . . . .

ENSMUSG00000094855 ...... ENSMUSG00000095019 ...... ENSMUSG00000095041 ......

#processing DefaultAssay(stromal) <- "RNA"
stromal <- NormalizeData(stromal) stromal <- FindVariableFeatures(stromal) stromal <- ScaleData(stromal) stromal <- RunPCA(stromal, npcs = 30, verbose = TRUE) ElbowPlot(stromal, ndims= 30)

2147bae0-d086-445e-9ed7-cb2ac14be76c

fingeram commented 2 months ago

image

bnprks commented 2 months ago

Thanks for the checks and the extra information. To summarize, it sounds like the current status is:

  1. The data in raw.mat looks reasonable at first glance
  2. If you convert a subset counts matrix to dgCMatrix then run Normalization + PCA you still get weird PCs even without involving the sketching or BPCells

If I'm understanding both those points correctly, then it would seem to indicate the issue is more likely with something in the data matrix since it causes errors with both BPCells and dgCMatrix. Though if there is an issue it must be in only a small number of entries since your spot-checks looked fine.

I'd encourage you to do some checks on your data matrix to check that there aren't an NA values or negative values in the counts:

# the @x slot is the values in a dgCMatrix
anyNA(stromal[["RNA"]]$counts@x) # should be FALSE
all(stromal[["RNA"]]$counts@x >= 0) # should be TRUE
any(stromal[["RNA"]]$counts@x == 0) # May or may not be true depending on whether any explicit zeros are stored in the matrix

If none of that finds anything suspicious, I'd try re-doing your test of point (2) I mentioned above very carefully to isolate if the problem is BPCells-dependent or not. To do that, I'd make a subset of the counts matrix, convert it to dgCMatrix, then make a fresh seurat object via CreateSeuratObject and do the analysis up to an Elbow plot. If you still get the same issues then we can at least eliminate a few sources of suspicion (i.e. may not be related to BPCells or sketching)

fingeram commented 1 month ago

Hi @bnprks,

apologies for my delayed reply, the weekend came in between. :)

I performed the tests that you suggested and for me the count matrix seems normal after transforming into dgCMatrix. However, same problem with RunPCA and the ElbowPlot. Very confusing. But I agree that probably there is nothing wrong with BPCells or the count matrix per se that is causing the issues.

image

I was not 100% sure I understood you other suggestions. But I followed the steps below and the problem persist. Do you think RunPCA could be the issue?

counts <- GetAssayData(object = stromal, assay = "RNA", slot = "counts") stromal.new <- CreateSeuratObject(counts = counts) stromal.new <- NormalizeData(stromal.new) stromal.new <- FindVariableFeatures(stromal.new) stromal.new <- ScaleData(stromal.new) stromal.new <- RunPCA(stromal.new, npcs = 30, verbose = TRUE, assay = "RNA", slot="counts") ElbowPlot(stromal.new, ndims= 30)

bnprks commented 1 month ago

Hi @fingeram, I'm not really sure what's going on here. Your second test matched what I had in mind -- basically I wanted to make sure there wasn't something unexpected slipping through in the data or scale.data slots.

I agree the problem seems to not be BPCells-related at this point. The only extra thing to check about the data I can think of would be if the max value is crazy high, e.g. max(stromal[["RNA"]]$counts@x) being above 100k.

Because the in-memory RunPCA workflow you're showing has been stable for so long, I'd be surprised if it was an issue in Seurat, though that's not completely out of the question. (I might check if a fresh install fixes anything).

My last debugging suggestion would be to work off the stromal.new workflow you showed and check between every step if any of the the counts, data or scale.data slots end up with very high or low values. If counts looks reasonable, then logically the issue must happen between CreateSeuratObject and ElbowPlot.

Unfortunately I don't think I'll be able to help much more beyond that, as I'm really just able to help with BPCells-related issues here.

fingeram commented 1 month ago

Hi @bnprks,

thank you so much for your help with this!! I looked for the largest value in my count matrix as you recommended but it also looked normal:

max(stromal[["RNA"]]$counts@x) [1] 1616

After long long digging, I found these two posts on the Seurat Discussions/Github page where people describe they were having issues with RunPCA when the number of npcs was set too high: #7451 and #7457.

So I tried RunPCA again while reducing npcs = 10 (I tried 30 and 50 before and both resulted in non-sense PCAs) and that worked fine now, i.e. Elbow Plot and Embeddings look fine. So I guess it is a bug related to Seurat's RunPCA.

Hope this helps if someone has a similar issue in the future! Again, thanks a lot for you help :) Best, Anna

bnprks commented 1 month ago

Thanks for posting back with your solution, glad you've at least been able to find a workaround to the problem