GreenleafLab / ArchR

ArchR : Analysis of Regulatory Chromatin in R (www.ArchRProject.com)
MIT License
388 stars 141 forks source link

Long Computation Time for addClusters with Multiple Samples #2167

Open limiao20 opened 6 months ago

limiao20 commented 6 months ago

Long Computation Time for addClusters with Multiple Samples Dear ArchR Development Team, I am currently using the ArchR package for analyzing my single-cell ATAC-seq data. I have encountered an issue where the addClusters step is taking an unexpectedly long time to complete when analyzing multiple samples. Detailed information is as follows: System Information:R Version:4.2.3,ArchR Version:1.0.2 Dataset Information:Number of Samples: 35,File Format:atac_fragments.tsv.gz.https://www.kpmp.org/available-data(Data sources)

Issue Description: When using addClusters with method = "Seurat" and resolution = 0.8, the computation time is significantly longer than expected. For instance, processing 9 samples takes around 9 hours, while processing 35 samples has taken approximately 180 hours and is still running. I have attached the log file for addcluster for your reference. ArchR-addClusters-256707ea3f45e-Date-2024-05-19_Time-18-07-53.242222.log

Code Snippet Here is the code I am using for the clustering analysis:

library(ArchR) addArchRGenome("hg38") addArchRThreads(threads = 10)

inputFiles <- c(#Input 35 atac_fragments.tsv.gz documents) sampleNames <- c(#35 sample names)

ArrowFiles <- createArrowFiles( inputFiles = inputFiles, sampleNames = sampleNames, filterTSS = 4, filterFrags = 1000, addTileMat = TRUE, addGeneScoreMat = TRUE )

doubScores <- addDoubletScores( input = ArrowFiles, k = 10, #Refers to how many cells near a "pseudo-doublet" to count. knnMethod = "UMAP", LSIMethod = 1 )

projHeme1 <- ArchRProject( ArrowFiles = ArrowFiles, outputDirectory = "HemeTutorial", copyArrows = TRUE #This is recommened so that if you modify the Arrow files you have an original copy for later usage. )

df <- getCellColData(projHeme1, select = c("log10(nFrags)", "TSSEnrichment")) df p <- ggPoint( x = df[,1], y = df[,2], colorDensity = TRUE, continuousSet = "sambaNight", xlabel = "Log10 Unique Fragments", ylabel = "TSS Enrichment", xlim = c(log10(500), quantile(df[,1], probs = 0.99)), ylim = c(0, quantile(df[,2], probs = 0.99)) ) + geom_hline(yintercept = 4, lty = "dashed") + geom_vline(xintercept = 3, lty = "dashed")

p ggsave("TSS-vs-Frags.pdf", plot = p)

p1 <- plotGroups( ArchRProj = projHeme1, groupBy = "Sample", colorBy = "cellColData", name = "TSSEnrichment", plotAs = "ridges" ) p1 p2 <- plotGroups( ArchRProj = projHeme1, groupBy = "Sample", colorBy = "cellColData", name = "TSSEnrichment", plotAs = "violin", alpha = 0.4, addBoxPlot = TRUE ) p2 p3 <- plotGroups( ArchRProj = projHeme1, groupBy = "Sample", colorBy = "cellColData", name = "log10(nFrags)", plotAs = "ridges" ) p3 p4 <- plotGroups( ArchRProj = projHeme1, groupBy = "Sample", colorBy = "cellColData", name = "log10(nFrags)", plotAs = "violin", alpha = 0.4, addBoxPlot = TRUE ) p4 library(ggplot2) library(gridExtra)

combined_plot <- grid.arrange(p1, p2, p3, p4, ncol = 2)

ggsave("QC-Sample-Statistics.pdf", plot = combined_plot, width = 8, height = 8)

p5<- plotFragmentSizes(ArchRProj = projHeme1) p5 p6 <- plotTSSEnrichment(ArchRProj = projHeme1) p6 combined_plot2 <- grid.arrange(p5, p6, ncol = 1)

ggsave("QC-Sample-FragSizes-TSSProfile.pdf", plot = combined_plot2, width = 8, height = 8) saveArchRProject(ArchRProj = projHeme1, outputDirectory = "Save-ProjHeme1", load = FALSE)

library(ArchR) projHeme1 <- loadArchRProject(path = "Save-ProjHeme1") print(projHeme1)

projHeme2 <- filterDoublets(projHeme1) projHeme2

2.1 Dimensionality Reduction with ArchR

projHeme2 <- addIterativeLSI( ArchRProj = projHeme2, useMatrix = "TileMatrix", name = "IterativeLSI", iterations = 2, clusterParams = list( #See Seurat::FindClusters resolution = c(0.2), sampleCells = 10000, n.start = 10 ), varFeatures = 25000, dimsToUse = 1:30 )

saveRDS(projHeme2, file = "projHeme2") projHeme2 <- readRDS("projHeme2")

2.2harmony

projHeme2 <- addHarmony( ArchRProj = projHeme2, reducedDims = "IterativeLSI", name = "Harmony", groupBy = "Sample" )

saveRDS(projHeme2, file = "projHeme3")

projHeme3 <- readRDS(file = "projHeme3")

3cluster

projHeme3 <- addClusters( input = projHeme3, reducedDims = "IterativeLSI", method = "Seurat", name = "Clusters", resolution = 0.8 )

saveArchRProject(ArchRProj = projHeme3, outputDirectory = "Save-ProjHeme3", load = FALSE)

head(projHeme3$Clusters) table(projHeme3$Clusters) cM <- confusionMatrix(paste0(projHeme3$Clusters), paste0(projHeme3$Sample)) cM library(pheatmap) cM <- cM / Matrix::rowSums(cM) p7 <- pheatmap::pheatmap( mat = as.matrix(cM), color = paletteContinuous("whiteBlue"), border_color = "black" ) p7 ggsave("cluster-heatmap.pdf", plot = p7)

4Single-cell Embeddings

projHeme3 <- addUMAP( ArchRProj = projHeme3, reducedDims = "IterativeLSI", name = "UMAP", nNeighbors = 30, minDist = 0.5, metric = "cosine" )

p8 <- plotEmbedding(ArchRProj = projHeme3, colorBy = "cellColData", name = "Sample", embedding = "UMAP") p9 <- plotEmbedding(ArchRProj = projHeme3, colorBy = "cellColData", name = "Clusters", embedding = "UMAP") combined_plot3 <- grid.arrange(p8, p9, ncol = 1)

ggsave("Plot-UMAP-Sample-Clusters.pdf", plot = combined_plot3, width = 8, height = 12)

saveArchRProject(ArchRProj = projHeme3, outputDirectory = "Save-ProjHeme4", load = FALSE)

Troubleshooting Steps Taken Verified the data format and content. Ensured sufficient computational resources (CPU and memory). Enabled parallel processing. Could you please help me understand why the analysis is taking so long and suggest any potential optimizations or solutions? Any guidance or recommendations would be greatly appreciated.

rcorces commented 6 months ago

Hi @limiao20! Thanks for using ArchR! Lately, it has been very challenging for me to keep up with maintenance of this package and all of my other responsibilities as a PI. I have not been responding to issue posts and I have not been pushing updates to the software. We are actively searching to hire a computational biologist to continue to develop and maintain ArchR and related tools. If you know someone who might be a good fit, please let us know! In the meantime, your issue will likely go without a reply. Most issues with ArchR right not relate to compatibility. Try reverting to R 4.1 and Bioconductor 3.15. Newer versions of Seurat and Matrix also are causing issues. Sorry for not being able to provide active support for this package at this time.

ananyakohli17 commented 3 months ago

Hi! Were you able to get past the computational time issue? I am facing a similar problem.

ananyakohli17 commented 3 months ago

Hi! Were you able to get past the computational time issue? I am facing a similar problem.

To be more specific, I have 540178 cells across 18 samples. My clustering after running harmony and using the default parameters has been running for the last 4 days. I wanted to ask if there is a certain threshold to the number of cells for which ArchR works efficiently? Thank you!