Closed asmagen closed 3 years ago
The simplest explanation is that there is not biology to drive subclustering and thus cells are essentially randomly assigned to non-meaningful clusters. Perhaps you can post UMAPs of your first dim. red. and your subclustering dim. red.
Also to add to @rcorces point, please supply code on how you subsetted and got to the plot you are showing.
Thanks for following up @rcorces @jgranja24 @archrdevs -- I'll try to provide more info as much as I can, starting from a specific example of the failure. I now think it's just a normalization issue since I do know there is biological variability and I do get differential gene accessibility patterns. For example, I know that the gene C5AR2 is differentially accessible and is most highest in cluster 4. I can see that in the average gene accessibility assay (from differential accessibility analysis) and heatmaps and can confirm it with:
> mat <- as.matrix(SummarizedExperiment::assays(seMarker)[["Mean"]])
> sort(mat[SummarizedExperiment::rowData(seMarker)$name == 'C5AR2',])
C13 C9 C8 C12 C7 C2 C11
0.09734134 0.12249700 0.12725155 0.13213156 0.13382265 0.14794171 0.18293850
C1 C10 C6 C5 C3 C4
0.19114822 0.20206666 0.24080766 0.24368088 0.25087416 0.37334085
> mat <- as.matrix(SummarizedExperiment::assays(seMarker)[["Mean"]])
> sort(mat[SummarizedExperiment::rowData(seMarker)$name == 'C5AR2',])
C13 C9 C8 C12 C7 C2 C11
0.09734134 0.12249700 0.12725155 0.13213156 0.13382265 0.14794171 0.18293850
C1 C10 C6 C5 C3 C4
0.19114822 0.20206666 0.24080766 0.24368088 0.25087416 0.37334085
> mat <- log2(t(t(mat)/colSums(mat)) * scaleTo + 1)
> sort(mat[SummarizedExperiment::rowData(seMarker)$name == 'C5AR2',])
C13 C9 C8 C12 C7 C2 C11 C1
0.1340124 0.1667116 0.1728095 0.1790416 0.1811950 0.1990494 0.2423751 0.2523530
C10 C6 C5 C3 C4
0.2655169 0.3112795 0.3146163 0.3229367 0.4576897
> mat <- sweep(mat - rowMeans(mat), 1, matrixStats::rowSds(mat), `/`)
> sort(mat[SummarizedExperiment::rowData(seMarker)$name == 'C5AR2',])
C13 C9 C8 C12 C7 C2
-1.25862805 -0.89152180 -0.82306221 -0.75309565 -0.72892019 -0.52847326
C11 C1 C10 C6 C5 C3
-0.04206635 0.06995285 0.21774130 0.73150682 0.76896876 0.86237898
C4
2.37521879
> mat[mat > max(limits)] <- max(limits)
> mat[mat < min(limits)] <- min(limits)
> sort(mat[SummarizedExperiment::rowData(seMarker)$name == 'C5AR2',])
C13 C9 C8 C12 C7 C2
-1.25862805 -0.89152180 -0.82306221 -0.75309565 -0.72892019 -0.52847326
C11 C1 C10 C6 C5 C3
-0.04206635 0.06995285 0.21774130 0.73150682 0.76896876 0.86237898
C4
2.37521879
This is basically showing that C4 is consistently the highest for this gene regardless of the various normalization steps when the average values are used.
On the other hand, when I try to look at the individual cell level, as in the plot I sent in the previous message, with a similar normalization steps, I don't see that gene higher in C4:
GeneScoreMatrix = getMatrixFromProject(human_TN, "GeneScoreMatrix")
genes = rowData(GeneScoreMatrix)$name
GeneScoreMatrix_assay = as.matrix(SummarizedExperiment::assays(GeneScoreMatrix)[[1]])
cell_to_cluster_map = human_TN$Clusters
> mat <- GeneScoreMatrix_assay
> sort(sapply(split(mat[genes == 'C5AR2',],cell_to_cluster_map),mean))
C2 C9 C7 C13 C6 C3 C12 C8
0.1497407 0.1500423 0.1736846 0.1756204 0.1811533 0.1831115 0.1834043 0.1859645
C5 C11 C4 C1 C10
0.1861856 0.1923411 0.1994413 0.2091466 0.2174349
> mat <- log2(t(t(mat)/colSums(mat)) * scaleTo + 1)
> sort(sapply(split(mat[genes == 'C5AR2',],cell_to_cluster_map),mean))
C2 C9 C7 C13 C1 C5 C6 C10
0.1208632 0.1304796 0.1486245 0.1500083 0.1501611 0.1545504 0.1545591 0.1556349
C12 C4 C8 C11 C3
0.1571651 0.1625396 0.1628328 0.1645879 0.1766497
> mat <- sweep(mat - rowMeans(mat), 1, matrixStats::rowSds(mat), `/`)
> sort(sapply(split(mat[genes == 'C5AR2',],cell_to_cluster_map),mean))
C2 C9 C7 C13 C1
-0.0812505589 -0.0588219396 -0.0165020741 -0.0132747078 -0.0129182301
C5 C6 C10 C12 C4
-0.0026809522 -0.0026606681 -0.0001515669 0.0034173110 0.0159523990
C8 C11 C3
0.0166362248 0.0207298574 0.0488618044
> mat[mat > max(limits)] <- max(limits)
> mat[mat < min(limits)] <- min(limits)
> sort(sapply(split(mat[genes == 'C5AR2',],cell_to_cluster_map),mean))
C2 C9 C1 C7 C13
-0.0852700637 -0.0599201168 -0.0265761822 -0.0165020741 -0.0150040755
C10 C5 C6 C12 C4
-0.0133435484 -0.0071724593 -0.0056736653 0.0002059736 0.0103429302
C8 C11 C3
0.0141255517 0.0207298574 0.0479984888
I used for both analysis:
limits = c(-5, 5)
scaleTo = 10^4
What is the difference between the two observations and how can I fix that so that the single cell level matches the average trends?
Thank you
Any idea what might it be or how to look for clues? @archrdevs
Could you post your umaps of your first dim. red. and your subclustering dim. red.? Or email them to us if you arent comfortable?
Closing due to inactivity. Feel free to comment again here if you need additional help.
Hi all. I ran into a similar problem to the original poster's, and am hoping you'd be able to help resolve it. I have a human tissue dataset that clustered nicely.
I subsetted clusters that are CD45/PTPRC positive (C7, C8, C9, C10) to perform subclustering using the code below:
idxSample <- BiocGenerics::which(proj1$Clusters %in% c('C7','C8','C9','C10')) cellsSample <- proj1$cellNames[idxSample] subsetproj <- proj1[cellsSample, ]
The resulting subsetproj object contains 4500 cells. When I reperform dimensionality reduction, clustering and addumap, this is what I get:
I'd love to hear any input you may have. Thank you for your help!
I have been clustering a specific mouse tissue with ArchR successfully, but when I attempt to perform subclustering I get something which appears to be somewhat randomized (poorly clustered) cell to cluster mapping. I was wondering if you have an idea what could have gone wrong or how to test it. For the first clustering iteration which worked well I used the parameters:
Then after filtering a specific (large) compartment of cells that accounts for 40% of the data, with parameters:
I get the following gene activation patterns which looks random: