Open limiao20 opened 6 months ago
Hi @limiao20! Thanks for using ArchR! Lately, it has been very challenging for me to keep up with maintenance of this package and all of my other responsibilities as a PI. I have not been responding to issue posts and I have not been pushing updates to the software. We are actively searching to hire a computational biologist to continue to develop and maintain ArchR and related tools. If you know someone who might be a good fit, please let us know! In the meantime, your issue will likely go without a reply. Most issues with ArchR right not relate to compatibility. Try reverting to R 4.1 and Bioconductor 3.15. Newer versions of Seurat and Matrix also are causing issues. Sorry for not being able to provide active support for this package at this time.
Hi! Were you able to get past the computational time issue? I am facing a similar problem.
Hi! Were you able to get past the computational time issue? I am facing a similar problem.
To be more specific, I have 540178 cells across 18 samples. My clustering after running harmony and using the default parameters has been running for the last 4 days. I wanted to ask if there is a certain threshold to the number of cells for which ArchR works efficiently? Thank you!
Long Computation Time for addClusters with Multiple Samples Dear ArchR Development Team, I am currently using the ArchR package for analyzing my single-cell ATAC-seq data. I have encountered an issue where the
addClusters
step is taking an unexpectedly long time to complete when analyzing multiple samples. Detailed information is as follows: System Information:R Version:4.2.3,ArchR Version:1.0.2 Dataset Information:Number of Samples: 35,File Format:atac_fragments.tsv.gz.https://www.kpmp.org/available-data(Data sources)Issue Description: When using
addClusters
withmethod = "Seurat"
andresolution = 0.8
, the computation time is significantly longer than expected. For instance, processing 9 samples takes around 9 hours, while processing 35 samples has taken approximately 180 hours and is still running. I have attached the log file for addcluster for your reference. ArchR-addClusters-256707ea3f45e-Date-2024-05-19_Time-18-07-53.242222.logCode Snippet Here is the code I am using for the clustering analysis:
library(ArchR) addArchRGenome("hg38") addArchRThreads(threads = 10)
inputFiles <- c(#Input 35 atac_fragments.tsv.gz documents) sampleNames <- c(#35 sample names)
ArrowFiles <- createArrowFiles( inputFiles = inputFiles, sampleNames = sampleNames, filterTSS = 4, filterFrags = 1000, addTileMat = TRUE, addGeneScoreMat = TRUE )
doubScores <- addDoubletScores( input = ArrowFiles, k = 10, #Refers to how many cells near a "pseudo-doublet" to count. knnMethod = "UMAP", LSIMethod = 1 )
projHeme1 <- ArchRProject( ArrowFiles = ArrowFiles, outputDirectory = "HemeTutorial", copyArrows = TRUE #This is recommened so that if you modify the Arrow files you have an original copy for later usage. )
df <- getCellColData(projHeme1, select = c("log10(nFrags)", "TSSEnrichment")) df p <- ggPoint( x = df[,1], y = df[,2], colorDensity = TRUE, continuousSet = "sambaNight", xlabel = "Log10 Unique Fragments", ylabel = "TSS Enrichment", xlim = c(log10(500), quantile(df[,1], probs = 0.99)), ylim = c(0, quantile(df[,2], probs = 0.99)) ) + geom_hline(yintercept = 4, lty = "dashed") + geom_vline(xintercept = 3, lty = "dashed")
p ggsave("TSS-vs-Frags.pdf", plot = p)
p1 <- plotGroups( ArchRProj = projHeme1, groupBy = "Sample", colorBy = "cellColData", name = "TSSEnrichment", plotAs = "ridges" ) p1 p2 <- plotGroups( ArchRProj = projHeme1, groupBy = "Sample", colorBy = "cellColData", name = "TSSEnrichment", plotAs = "violin", alpha = 0.4, addBoxPlot = TRUE ) p2 p3 <- plotGroups( ArchRProj = projHeme1, groupBy = "Sample", colorBy = "cellColData", name = "log10(nFrags)", plotAs = "ridges" ) p3 p4 <- plotGroups( ArchRProj = projHeme1, groupBy = "Sample", colorBy = "cellColData", name = "log10(nFrags)", plotAs = "violin", alpha = 0.4, addBoxPlot = TRUE ) p4 library(ggplot2) library(gridExtra)
combined_plot <- grid.arrange(p1, p2, p3, p4, ncol = 2)
ggsave("QC-Sample-Statistics.pdf", plot = combined_plot, width = 8, height = 8)
p5<- plotFragmentSizes(ArchRProj = projHeme1) p5 p6 <- plotTSSEnrichment(ArchRProj = projHeme1) p6 combined_plot2 <- grid.arrange(p5, p6, ncol = 1)
ggsave("QC-Sample-FragSizes-TSSProfile.pdf", plot = combined_plot2, width = 8, height = 8) saveArchRProject(ArchRProj = projHeme1, outputDirectory = "Save-ProjHeme1", load = FALSE)
library(ArchR) projHeme1 <- loadArchRProject(path = "Save-ProjHeme1") print(projHeme1)
projHeme2 <- filterDoublets(projHeme1) projHeme2
2.1 Dimensionality Reduction with ArchR
projHeme2 <- addIterativeLSI( ArchRProj = projHeme2, useMatrix = "TileMatrix", name = "IterativeLSI", iterations = 2, clusterParams = list( #See Seurat::FindClusters resolution = c(0.2), sampleCells = 10000, n.start = 10 ), varFeatures = 25000, dimsToUse = 1:30 )
saveRDS(projHeme2, file = "projHeme2") projHeme2 <- readRDS("projHeme2")
2.2harmony
projHeme2 <- addHarmony( ArchRProj = projHeme2, reducedDims = "IterativeLSI", name = "Harmony", groupBy = "Sample" )
saveRDS(projHeme2, file = "projHeme3")
projHeme3 <- readRDS(file = "projHeme3")
3cluster
projHeme3 <- addClusters( input = projHeme3, reducedDims = "IterativeLSI", method = "Seurat", name = "Clusters", resolution = 0.8 )
saveArchRProject(ArchRProj = projHeme3, outputDirectory = "Save-ProjHeme3", load = FALSE)
head(projHeme3$Clusters) table(projHeme3$Clusters) cM <- confusionMatrix(paste0(projHeme3$Clusters), paste0(projHeme3$Sample)) cM library(pheatmap) cM <- cM / Matrix::rowSums(cM) p7 <- pheatmap::pheatmap( mat = as.matrix(cM), color = paletteContinuous("whiteBlue"), border_color = "black" ) p7 ggsave("cluster-heatmap.pdf", plot = p7)
4Single-cell Embeddings
projHeme3 <- addUMAP( ArchRProj = projHeme3, reducedDims = "IterativeLSI", name = "UMAP", nNeighbors = 30, minDist = 0.5, metric = "cosine" )
p8 <- plotEmbedding(ArchRProj = projHeme3, colorBy = "cellColData", name = "Sample", embedding = "UMAP") p9 <- plotEmbedding(ArchRProj = projHeme3, colorBy = "cellColData", name = "Clusters", embedding = "UMAP") combined_plot3 <- grid.arrange(p8, p9, ncol = 1)
ggsave("Plot-UMAP-Sample-Clusters.pdf", plot = combined_plot3, width = 8, height = 12)
saveArchRProject(ArchRProj = projHeme3, outputDirectory = "Save-ProjHeme4", load = FALSE)
Troubleshooting Steps Taken Verified the data format and content. Ensured sufficient computational resources (CPU and memory). Enabled parallel processing. Could you please help me understand why the analysis is taking so long and suggest any potential optimizations or solutions? Any guidance or recommendations would be greatly appreciated.