BPCells::svds Failing on Large Matrix

bnprks / BPCells

Scaling Single Cell Analysis to Millions of Cells

https://bnprks.github.io/BPCells

Other

166 stars 17 forks source link

BPCells::svds Failing on Large Matrix #125

Open Sandman-1 opened 2 months ago

Sandman-1 commented 2 months ago

Hello! Thank you so much for this package. I would like to run svds using the BPCells wrapper function on a matrix containing 9,096 genes and 4,188,533 cells. I followed the workflow outlined in the vignette, including writing a temporary matrix file using write_matrix_dir(scaled, tempfile("mat")) before running BPCells::svds on this temporary matrix. Unfortunately, it returned the following error after ~40 hours of runtime: Error: TridiagEigen: eigen decomposition failed. Could you please help me resolve this issue?

srt <- FindVariableFeatures(srt, nfeatures = round(0.25*nrow(srt)))
cell_peak_stats <- matrix_stats(srt@assays$RNA$data, col_stats="variance")$col_stats
cell_means <- cell_peak_stats["mean",]
cell_vars <- cell_peak_stats["variance",]
scaled <- srt@assays$RNA$data %>%
add_cols(-cell_means) %>%
multiply_cols(1 / cell_vars)
mat <- write_matrix_dir(scaled[VariableFeatures(srt),], tempfile("mat"))
svd <- BPCells::svds(mat, k=100, threads = 6)
pca <- multiply_cols(svd$v, svd$d)

bnprks commented 2 months ago

Hi @Sandman-1, thanks for your question! First off, 40 hours is much longer than this operation should take -- I think 4 million cells with 2k variable features and k=50 should take under 30 minutes single-threaded, and probably under 10 minutes with 6 cores (though this can vary a bit depending on sequencing depth and choice of variable gene count).

One issue I see is that you should call write_matrix_dir before you call add_cols. The reason is that add_cols will cause your matrix to no longer be sparse (almost no entries will be equal to exactly 0). Although BPCells can handle add_cols efficiently during svds, the write_matrix_dir call loses that information.

Another slightly strange thing in your above code is that normally people scale by the gene mean and variance, not the cell stats. This may be changing your matrix in a way that causes the decomposition to fail. (From searching the TridiagEigen error, it comes from RSpectra / Spectra, and seems to be related to the contents and eigenvectors of specific matrices)

I'd suggest something like:


srt <- FindVariableFeatures(srt, nfeatures = round(0.25*nrow(srt)))
gene_stats <- matrix_stats(srt@assays$RNA$data, row_stats="variance")$row_stats[,VariableFeatures(srt)]
scaled <- srt@assays$RNA$data[VariableFeatures(srt),]  %>%
  multiply_rows(1/gene_stats["variance",]) %>%
  write_matrix_dir(tempfile("mat")) %>%
  add_rows(-gene_stats["mean",]/gene_stats["variance",])

svd <- BPCells::svds(mat, k=100, threads = 6)
pca <- multiply_cols(svd$v, svd$d)

That should at the very least run much faster, and hopefully it also avoids the Eigen error. Let me know how it goes

-Ben

Sandman-1 commented 2 months ago

Thank you so much! The matrix stats function is still running from yesterday morning but thankfully hasn't crashed yet. It definitely is not as fast as finding cell stats instead of gene stats. I will say that my matrix that I am using as input was normalized for inverse document frequency as well as L2norm following running ALRA on the library-normalized counts, so I thought it was more similar to normalization used in ATACseq analyses. But I am cool with using the gene-wise stats too! I did run FindVariableFeatures on my matrix using Seurat, which finished in under 30 minutes. Is there any way I can use the stats collected from this function?

Sandman-1 commented 2 months ago

Hello! I have an update to this issue. I in fact did use the stats calculated by the FindVariableFeatures command in Seurat as input to get gene-wise variance and mean expression before performing z-score normalization (scaling) on my matrix. However, I still got the same TriadEigen error when running svds that I listed in the first comment. My updated code is as follows. Please let me know how I can resolve this.

srt <- FindVariableFeatures(srt, nfeatures = round(0.25*nrow(srt)))
var.features.meta <- srt@assays$RNA@meta.data
var.features.meta <- subset(var.features.meta, var.features %in% VariableFeatures(srt))
rownames(var.features.meta) <- var.features.meta$var.features
scaled <- write_matrix_dir(srt@assays$RNA$data[VariableFeatures(srt),], tempfile("mat"), overwrite = T)
scaled <- (scaled - var.features.meta[VariableFeatures(srt), "vf_vst_counts_mean"])/var.features.meta[VariableFeatures(srt), "vf_vst_counts_variance.standardized"]
svd <- BPCells::svds(scaled, k=100, threads = 6)

bnprks commented 2 months ago

Hmm, I strongly suspect that the TridiagEigen error has to do with the contents of your matrix. Have you checked whether it is possible to run svd on the matrix outside of BPCells? (Or possibly a large subset of the dataset if it crashes other tools?) That would be my suggested way for you to better diagnose this issue.

If you are able to find a matrix that can decompose with e.g. the R svd function but can't with the BPCells::svds function, then that would be a useful starting point for me to do some debugging.

Otherwise you might just double-check that you don't have any rows or columns that are all zero or all the same value

I won't be able to take much more of a look at this in the near term as I am on vacation for the next 10 days, but will be happy to check back in on this afterwards

bnprks commented 2 months ago

Hi @Sandman-1, I'm back from vacation now. I did think of another troubleshooting thing you could test, which would be calling svd <- irlba::irlba(scaled, nv=100). Since the error you're getting is coming from the Spectra solver BPCells uses, this would test whether your error is solver-specific. (That won't use multiple threads, but it should keep memory usage low)

Other than that, I'd probably need access to a data matrix that causes this error to effectively investigate further. I'm happy to arrange that over email if you like (bparks [at] stanford.edu), and you might try checking if data subsets or randomized/anonymized versions can still cause your error to make sharing more feasible.

Sandman-1 commented 2 months ago

Hello, Ben! Hope your vacation went well. Thank you so much for all the suggestions you gave me. I managed to get this step sorted out. I just had to remove 10 cells that had NA values for some of their counts. Wonder how those got there, but anyways...

I am now in the process of clustering and annotating my dataset, and I did have a few questions regarding this. I can create a separate thread if you would like. Should I expect over 177 and 12,000 clusters to be generated from the cluster_graph_louvain and cluster_graph_leiden commands, respectively, with default values? I understand I have a lot of cells (approximately 4.2 million), but this does seem like an excessive amount of clusters. I also don't know why the Leiden algorithm is giving me clusters in the thousands especially.

For more information, I am using the integrated harmony reduction I created from principal components 2-100. I threw out PC 1 because it explains less variance than PC 2. My code to do all of this is as follows:

srt <- RunHarmony(
object = srt,
group.by.vars = "Sample",
lambda = NULL,
plot_convergence = T,
max_iter = 25,
dims.use = 2:100,
project.dim = F)
pca <- as.matrix(srt@reductions$pca@cell.embeddings)[,2:100]
pca.nn <- knn_hnsw(pca, ef=500)
pca.snn <- knn_to_snn_graph(pca.nn, min_val = 0.05)
rownames(pca.snn) <- colnames(pca.snn) <- colnames(srt)
pca.clusters.leiden <- cluster_graph_leiden(pca.snn)
pca.clusters.louvain <- cluster_graph_louvain(pca.snn)
pca.snn <- as.Graph(pca.snn)
harmony <- as.matrix(srt@reductions$harmony@cell.embeddings)
harmony.nn <- knn_hnsw(harmony, ef=500)
harmony.snn <- knn_to_snn_graph(harmony.nn, min_val = 0.05)
rownames(harmony.snn) <- colnames(harmony.snn) <- colnames(srt)
harmony.clusters.leiden <- cluster_graph_leiden(harmony.snn)
harmony.clusters.louvain <- cluster_graph_louvain(harmony.snn)
harmony.snn <- as.Graph(harmony.snn)

bnprks commented 2 months ago

Hi @Sandman-1, glad that you were able to get the SVDs sorted out. You can lower the number of clusters you get by setting the resolution parameter smaller (I'd suggest starting by making it 10x smaller on leiden, then dialing it back if needed). I have noticed that leiden has a tendency to give way too many clusters (hence why I've decreased the default resolution parameter to 0.001), but this depends on the data so you may need something even smaller.

Ultimately there's not a "correct" number of clusters, it just depends what granularity you want for your analysis.

The BPCells clustering functions are just thin wrappers around igraph or Seurat functions, so probably the only change we'd reasonably be able to make here is adjusting the default resolution parameter. If you experiment and find some values that get you closer to your desired cluster count, feel free to let me know so I can think about adjusting the default for the future.