GreenleafLab / ArchR

ArchR : Analysis of Regulatory Chromatin in R (www.ArchRProject.com)
MIT License
385 stars 138 forks source link

Clustering issue #539

Closed asmagen closed 3 years ago

asmagen commented 3 years ago

I have been clustering a specific mouse tissue with ArchR successfully, but when I attempt to perform subclustering I get something which appears to be somewhat randomized (poorly clustered) cell to cluster mapping. I was wondering if you have an idea what could have gone wrong or how to test it. For the first clustering iteration which worked well I used the parameters:

IterativeLSI:
        iterations = 4, 
        clusterParams = list( #See Seurat::FindClusters
            resolution = c(0.1, 0.2, 0.4),
            sampleCells = 10000, 
            n.start = 10
        ), 
        varFeatures = 25000, 
        dimsToUse = 1:30
addClusters:
        reducedDims = "IterativeLSI",
        method = "Seurat",
        name = "Clusters",
        resolution = 1.2

Then after filtering a specific (large) compartment of cells that accounts for 40% of the data, with parameters:

name = "IterativeLSI2", 
        iterations = 2, 
        clusterParams = list( #See Seurat::FindClusters
            resolution = c(0.4), 
            sampleCells = 25000, 
            n.start = 10
        ), 
        sampleCellsPre = 25000,
        varFeatures = 15000, 
        dimsToUse = 1:30
reducedDims = "IterativeLSI2",
        method = "Seurat",
        name = "Clusters",
        resolution = 1.5,
        force = T

I get the following gene activation patterns which looks random:

Screen Shot 2021-02-18 at 8 32 20 AM

rcorces commented 3 years ago

The simplest explanation is that there is not biology to drive subclustering and thus cells are essentially randomly assigned to non-meaningful clusters. Perhaps you can post UMAPs of your first dim. red. and your subclustering dim. red.

jgranja24 commented 3 years ago

Also to add to @rcorces point, please supply code on how you subsetted and got to the plot you are showing.

asmagen commented 3 years ago

Thanks for following up @rcorces @jgranja24 @archrdevs -- I'll try to provide more info as much as I can, starting from a specific example of the failure. I now think it's just a normalization issue since I do know there is biological variability and I do get differential gene accessibility patterns. For example, I know that the gene C5AR2 is differentially accessible and is most highest in cluster 4. I can see that in the average gene accessibility assay (from differential accessibility analysis) and heatmaps and can confirm it with:

> mat <- as.matrix(SummarizedExperiment::assays(seMarker)[["Mean"]])
> sort(mat[SummarizedExperiment::rowData(seMarker)$name == 'C5AR2',])
       C13         C9         C8        C12         C7         C2        C11 
0.09734134 0.12249700 0.12725155 0.13213156 0.13382265 0.14794171 0.18293850 
        C1        C10         C6         C5         C3         C4 
0.19114822 0.20206666 0.24080766 0.24368088 0.25087416 0.37334085 
> mat <- as.matrix(SummarizedExperiment::assays(seMarker)[["Mean"]])
> sort(mat[SummarizedExperiment::rowData(seMarker)$name == 'C5AR2',])
       C13         C9         C8        C12         C7         C2        C11 
0.09734134 0.12249700 0.12725155 0.13213156 0.13382265 0.14794171 0.18293850 
        C1        C10         C6         C5         C3         C4 
0.19114822 0.20206666 0.24080766 0.24368088 0.25087416 0.37334085 
> mat <- log2(t(t(mat)/colSums(mat)) * scaleTo + 1)
> sort(mat[SummarizedExperiment::rowData(seMarker)$name == 'C5AR2',])
      C13        C9        C8       C12        C7        C2       C11        C1 
0.1340124 0.1667116 0.1728095 0.1790416 0.1811950 0.1990494 0.2423751 0.2523530 
      C10        C6        C5        C3        C4 
0.2655169 0.3112795 0.3146163 0.3229367 0.4576897 
> mat <- sweep(mat - rowMeans(mat), 1, matrixStats::rowSds(mat), `/`)
> sort(mat[SummarizedExperiment::rowData(seMarker)$name == 'C5AR2',])
        C13          C9          C8         C12          C7          C2 
-1.25862805 -0.89152180 -0.82306221 -0.75309565 -0.72892019 -0.52847326 
        C11          C1         C10          C6          C5          C3 
-0.04206635  0.06995285  0.21774130  0.73150682  0.76896876  0.86237898 
         C4 
 2.37521879 
> mat[mat > max(limits)] <- max(limits)
> mat[mat < min(limits)] <- min(limits)
> sort(mat[SummarizedExperiment::rowData(seMarker)$name == 'C5AR2',])
        C13          C9          C8         C12          C7          C2 
-1.25862805 -0.89152180 -0.82306221 -0.75309565 -0.72892019 -0.52847326 
        C11          C1         C10          C6          C5          C3 
-0.04206635  0.06995285  0.21774130  0.73150682  0.76896876  0.86237898 
         C4 
 2.37521879 

This is basically showing that C4 is consistently the highest for this gene regardless of the various normalization steps when the average values are used.

On the other hand, when I try to look at the individual cell level, as in the plot I sent in the previous message, with a similar normalization steps, I don't see that gene higher in C4:

GeneScoreMatrix = getMatrixFromProject(human_TN, "GeneScoreMatrix")
genes = rowData(GeneScoreMatrix)$name
GeneScoreMatrix_assay = as.matrix(SummarizedExperiment::assays(GeneScoreMatrix)[[1]])
cell_to_cluster_map = human_TN$Clusters
> mat <- GeneScoreMatrix_assay
> sort(sapply(split(mat[genes == 'C5AR2',],cell_to_cluster_map),mean))
       C2        C9        C7       C13        C6        C3       C12        C8 
0.1497407 0.1500423 0.1736846 0.1756204 0.1811533 0.1831115 0.1834043 0.1859645 
       C5       C11        C4        C1       C10 
0.1861856 0.1923411 0.1994413 0.2091466 0.2174349 
> mat <- log2(t(t(mat)/colSums(mat)) * scaleTo + 1)
> sort(sapply(split(mat[genes == 'C5AR2',],cell_to_cluster_map),mean))
       C2        C9        C7       C13        C1        C5        C6       C10 
0.1208632 0.1304796 0.1486245 0.1500083 0.1501611 0.1545504 0.1545591 0.1556349 
      C12        C4        C8       C11        C3 
0.1571651 0.1625396 0.1628328 0.1645879 0.1766497 
> mat <- sweep(mat - rowMeans(mat), 1, matrixStats::rowSds(mat), `/`)
> sort(sapply(split(mat[genes == 'C5AR2',],cell_to_cluster_map),mean))
           C2            C9            C7           C13            C1 
-0.0812505589 -0.0588219396 -0.0165020741 -0.0132747078 -0.0129182301 
           C5            C6           C10           C12            C4 
-0.0026809522 -0.0026606681 -0.0001515669  0.0034173110  0.0159523990 
           C8           C11            C3 
 0.0166362248  0.0207298574  0.0488618044 
> mat[mat > max(limits)] <- max(limits)
> mat[mat < min(limits)] <- min(limits)
> sort(sapply(split(mat[genes == 'C5AR2',],cell_to_cluster_map),mean))
           C2            C9            C1            C7           C13 
-0.0852700637 -0.0599201168 -0.0265761822 -0.0165020741 -0.0150040755 
          C10            C5            C6           C12            C4 
-0.0133435484 -0.0071724593 -0.0056736653  0.0002059736  0.0103429302 
           C8           C11            C3 
 0.0141255517  0.0207298574  0.0479984888 

I used for both analysis:

limits = c(-5, 5)
scaleTo = 10^4

What is the difference between the two observations and how can I fix that so that the single cell level matches the average trends?

Thank you

asmagen commented 3 years ago

Any idea what might it be or how to look for clues? @archrdevs

rcorces commented 3 years ago

Could you post your umaps of your first dim. red. and your subclustering dim. red.? Or email them to us if you arent comfortable?

rcorces commented 3 years ago

Closing due to inactivity. Feel free to comment again here if you need additional help.

khain2650 commented 2 years ago

Hi all. I ran into a similar problem to the original poster's, and am hoping you'd be able to help resolve it. I have a human tissue dataset that clustered nicely. Screen Shot 2022-08-25 at 5 13 45 PM I subsetted clusters that are CD45/PTPRC positive (C7, C8, C9, C10) to perform subclustering using the code below: idxSample <- BiocGenerics::which(proj1$Clusters %in% c('C7','C8','C9','C10')) cellsSample <- proj1$cellNames[idxSample] subsetproj <- proj1[cellsSample, ]

The resulting subsetproj object contains 4500 cells. When I reperform dimensionality reduction, clustering and addumap, this is what I get: Screen Shot 2022-08-25 at 5 18 01 PM

I'd love to hear any input you may have. Thank you for your help!