GreenleafLab / ArchR

ArchR : Analysis of Regulatory Chromatin in R (www.ArchRProject.com)
MIT License
384 stars 137 forks source link

Different group labels for the same predicted cell in integrated scRNA #1351

Open Zepeng-Mu opened 2 years ago

Zepeng-Mu commented 2 years ago

Hi, I have noticed strange behavior when adding GeneIntegrationMatrix. I have posted a Discussion several months ago but I think it might be useful to open an Issue as well.

Basically, when integrating scRNA into scATAC, sometime scRNA barcode is assigned to more than one scATAC cell, in this case I have noticed that the nameGroup from integration for the same cell can have DIFFERENT labels.

The Discussion post I created before that gave a detailed example is here: https://github.com/GreenleafLab/ArchR/discussions/1234 I'm using release_1.0.2.

Thanks!!

rcorces commented 2 years ago

Hi @Zepeng-Mu! Thanks for using ArchR! Please make sure that your post belongs in the Issues section. Only bugs and error reports belong in the Issues section. Usage questions and feature requests should be posted in the Discussions section, not in Issues.
Before we help you, you must respond to the following questions unless your original post already contained this information: 1. If you've encountered an error, have you already searched previous Issues to make sure that this hasn't already been solved? 2. Can you recapitulate your error using the tutorial code and dataset? If so, provide a reproducible example. 3. Did you post your log file? If not, add it now.

Zepeng-Mu commented 2 years ago

Hi @rcorces, I can confirm this behavior is reproducible using tutorial data.

library(ArchR)

inputFiles <- getTutorialData("Hematopoiesis")
inputFiles

addArchRGenome("hg19")
addArchRThreads(threads = 14) 

ArrowFiles <- createArrowFiles(
  inputFiles = inputFiles,
  sampleNames = names(inputFiles),
  minTSS = 4, #Dont set this too high because you can always increase later
  minFrags = 1000, 
  addTileMat = TRUE,
  addGeneScoreMat = TRUE
)

doubScores <- addDoubletScores(
  input = ArrowFiles,
  k = 10, #Refers to how many cells near a "pseudo-doublet" to count.
  knnMethod = "UMAP", #Refers to the embedding to use for nearest neighbor search with doublet projection.
  LSIMethod = 1
)

projHeme1 <- ArchRProject(
  ArrowFiles = ArrowFiles, 
  outputDirectory = "HemeTutorial",
  copyArrows = TRUE #This is recommened so that if you modify the Arrow files you have an original copy for later usage.
)

projHeme2 <- filterDoublets(projHeme1)

projHeme2 <- addIterativeLSI(
  ArchRProj = projHeme2,
  useMatrix = "TileMatrix", 
  name = "IterativeLSI", 
  iterations = 2, 
  clusterParams = list( #See Seurat::FindClusters
    resolution = c(0.2), 
    sampleCells = 10000, 
    n.start = 10
  ), 
  varFeatures = 25000, 
  dimsToUse = 1:30
)

projHeme2 <- addHarmony(
  ArchRProj = projHeme2,
  reducedDims = "IterativeLSI",
  name = "Harmony",
  groupBy = "Sample"
)

seRNA <- readRDS("scRNA-Hematopoiesis-Granja-2019.rds")

projHeme2 <- addGeneIntegrationMatrix(
  ArchRProj = projHeme2, 
  useMatrix = "GeneScoreMatrix",
  matrixName = "GeneIntegrationMatrix",
  reducedDims = "IterativeLSI",
  seRNA = seRNA,
  addToArrow = FALSE,
  groupRNA = "BioClassification",
  nameCell = "predictedCell_Un",
  nameGroup = "predictedGroup_Un",
  nameScore = "predictedScore_Un"
)

and then we can check:

sort(table(projHeme2$predictedCell_Un), decreasing=T)[1]

gives

CD34_32_R5:CGTAGCGAGTTCGCGC-1 
                          615

So this cell is mapped to 615 cells in scATAC dataset.

seRNA$BioClassification[colnames(seRNA) == "CD34_32_R5:CGTAGCGAGTTCGCGC-1"]
[1] "02_Early.Eryth"

and in seRNA this is annotated as "02_Early.Eryth". However,

> table(projHeme2$predictedGroup_Un[projHeme2$predictedCell_Un == "CD34_32_R5:CGTAGCGAGTTCGCGC-1"])

        01_HSC 02_Early.Eryth  03_Late.Eryth    08_GMP.Neut 
           354                       255                       1                          5

Basically for scATAC cells with predictedCell_Un "CD34_32_R5:CGTAGCGAGTTCGCGC-1", the corresponding predictedGroup_Un has four different annotated groups.

Here is the log:

          /   \     |   _  \      /      ||  |  |  | |   _  \
         /  ^  \    |  |_)  |    |  ,----'|  |__|  | |  |_)  |
        /  /_\  \   |      /     |  |     |   __   | |      /
       /  _____  \  |  |\  \\___ |  `----.|  |  |  | |  |\  \\___.
      /__/     \__\ | _| `._____| \______||__|  |__| | _| `._____|

Logging With ArchR!

Start Time : 2022-03-24 10:11:07

------- ArchR Info

ArchRThreads = 14
ArchRGenome = Hg19

------- System Info

Computer OS = unix
Total Cores = 28

------- Session Info

R version 4.0.5 (2021-03-31)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Ubuntu 20.04.2 LTS

Matrix products: default
BLAS/LAPACK: /scratch/midway2/zepengmu/conda_envs/newbase/lib/libopenblasp-r0.3.18.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8
 [8] LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
 [1] grid      parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
 [1] harmony_0.1.0                     Rcpp_1.0.8.3                      uwot_0.1.11                       gridExtra_2.3                     nabor_0.5.1
 [6] SeuratObject_4.0.4                Seurat_4.1.0                      BSgenome.Hsapiens.UCSC.hg19_1.4.3 BSgenome_1.58.0                   rtracklayer_1.50.0
[11] Biostrings_2.58.0                 XVector_0.30.0                    ArchR_1.0.2                       magrittr_2.0.2                    rhdf5_2.34.0
[16] Matrix_1.4-0                      data.table_1.14.2                 SummarizedExperiment_1.20.0       Biobase_2.50.0                    GenomicRanges_1.42.0
[21] GenomeInfoDb_1.26.7               IRanges_2.24.1                    S4Vectors_0.28.1                  BiocGenerics_0.36.1               MatrixGenerics_1.2.1
[26] matrixStats_0.61.0                ggplot2_3.3.5

loaded via a namespace (and not attached):
  [1] plyr_1.8.6               igraph_1.2.11            lazyeval_0.2.2           splines_4.0.5            BiocParallel_1.24.1      listenv_0.8.0            scattermore_0.7
  [8] digest_0.6.29            htmltools_0.5.2          fansi_1.0.2              tensor_1.5               cluster_2.1.2            ROCR_1.0-11              globals_0.14.0
 [15] spatstat.sparse_2.0-0    colorspace_2.0-3         ggrepel_0.9.1            xfun_0.30                dplyr_1.0.8              crayon_1.5.0             RCurl_1.98-1.6
 [22] jsonlite_1.8.0           spatstat.data_2.1-0      survival_3.2-13          zoo_1.8-9                glue_1.6.2               polyclip_1.10-0          gtable_0.3.0
 [29] zlibbioc_1.36.0          leiden_0.3.9             DelayedArray_0.16.3      Rhdf5lib_1.12.1          future.apply_1.8.1       abind_1.4-5              scales_1.1.1
 [36] DBI_1.1.2                miniUI_0.1.1.1           viridisLite_0.4.0        xtable_1.8-4             reticulate_1.24          spatstat.core_2.3-1      htmlwidgets_1.5.4
 [43] httr_1.4.2               RColorBrewer_1.1-2       ellipsis_0.3.2           ica_1.0-2                pkgconfig_2.0.3          XML_3.99-0.9             farver_2.1.0
 [50] deldir_1.0-6             utf8_1.2.2               tidyselect_1.1.2         labeling_0.4.2           rlang_1.0.2              reshape2_1.4.4           later_1.3.0
 [57] munsell_0.5.0            tools_4.0.5              cli_3.2.0                generics_0.1.2           ggridges_0.5.3           stringr_1.4.0            fastmap_1.1.0
 [64] goftest_1.2-3            fitdistrplus_1.1-6       purrr_0.3.4              RANN_2.6.1               pbapply_1.5-0            future_1.24.0            nlme_3.1-155
 [71] mime_0.12                compiler_4.0.5           rstudioapi_0.13          plotly_4.10.0            png_0.1-7                spatstat.utils_2.2-0     tibble_3.1.6
 [78] stringi_1.7.6            RSpectra_0.16-0          lattice_0.20-45          vctrs_0.3.8              pillar_1.7.0             lifecycle_1.0.1          rhdf5filters_1.2.1
 [85] spatstat.geom_2.3-0      lmtest_0.9-39            RcppAnnoy_0.0.19         cowplot_1.1.1            bitops_1.0-7             irlba_2.3.5              httpuv_1.6.5
 [92] patchwork_1.1.1          R6_2.5.1                 promises_1.2.0.1         KernSmooth_2.23-20       parallelly_1.30.0        codetools_0.2-18         MASS_7.3-55
 [99] gtools_3.9.2             assertthat_0.2.1         withr_2.5.0              GenomicAlignments_1.26.0 sctransform_0.3.3        Rsamtools_2.6.0          GenomeInfoDbData_1.2.4
[106] mgcv_1.8-39              rpart_4.1-15             tidyr_1.2.0              Cairo_1.5-15             Rtsne_0.15               shiny_1.7.1              tinytex_0.37

------- Log Info

2022-03-24 10:11:07 : Running Seurat's Integration Stuart* et al 2019, 0.001 mins elapsed.

2022-03-24 10:11:07 : Input-Parameters, Class = list

Input-Parameters$: length = 1

1 function (name)
2 .Internal(args(name))

Input-Parameters$ArchRProj: length = 1

Input-Parameters$useMatrix: length = 1
[1] "GeneScoreMatrix"

Input-Parameters$matrixName: length = 1
[1] "GeneIntegrationMatrix"

Input-Parameters$reducedDims: length = 1
[1] "IterativeLSI"

Input-Parameters$seRNA: length = 20287
class: RangedSummarizedExperiment
dim: 6 35582
metadata(0):
assays(1): counts
rownames(6): FAM138A OR4F5 ... OR4F16 FAM87B
rowData names(3): gene_name gene_id exonLength
colnames(35582): CD34_32_R5:AAACCTGAGTATCGAA-1 CD34_32_R5:AAACCTGAGTCGTTTG-1 ... BMMC_10x_GREENLEAF_REP2:TTTGTTGCATGTGTCA-1 BMMC_10x_GREENLEAF_REP2:TTTGTTGCATTGAAAG-1
colData names(10): Group nUMI_pre ... BioClassification Barcode

Input-Parameters$groupATAC: length = 0
NULL

Input-Parameters$groupRNA: length = 1
[1] "BioClassification"

Input-Parameters$groupList: length = 0
NULL

Input-Parameters$sampleCellsATAC: length = 1
[1] 10000

Input-Parameters$sampleCellsRNA: length = 1
[1] 10000

Input-Parameters$embeddingATAC: length = 0
NULL

Input-Parameters$embeddingRNA: length = 0
NULL

Input-Parameters$dimsToUse: length = 30
[1] 1 2 3 4 5 6

Input-Parameters$scaleDims: length = 0
NULL

Input-Parameters$corCutOff: length = 1
[1] 0.75

Input-Parameters$plotUMAP: length = 1
[1] TRUE

Input-Parameters$nGenes: length = 1
[1] 2000

Input-Parameters$useImputation: length = 1
[1] TRUE

Input-Parameters$reduction: length = 1
[1] "cca"

Input-Parameters$addToArrow: length = 1
[1] FALSE

Input-Parameters$scaleTo: length = 1
[1] 10000

Input-Parameters$genesUse: length = 0
NULL

Input-Parameters$nameCell: length = 1
[1] "predictedCell_Un"

Input-Parameters$nameGroup: length = 1
[1] "predictedGroup_Un"

Input-Parameters$nameScore: length = 1
[1] "predictedScore_Un"

Input-Parameters$threads: length = 1
[1] 14

Input-Parameters$verbose: length = 1
[1] TRUE

Input-Parameters$force: length = 1
[1] FALSE

Input-Parameters$logFile: length = 1
[1] "ArchRLogs/ArchR-addGeneIntegrationMatrix-ac5dcbaebc-Date-2022-03-24_Time-10-11-07.log"

2022-03-24 10:11:07 : Checking ATAC Input, 0.006 mins elapsed.
2022-03-24 10:11:07 : Checking RNA Input, 0.011 mins elapsed.
2022-03-24 10:11:20 : Found 18601 overlapping gene names from gene scores and rna matrix!, 0.213 mins elapsed.
2022-03-24 10:11:20 : Creating Integration Blocks, 0.213 mins elapsed.
2022-03-24 10:11:20 : Prepping Interation Data, 0.218 mins elapsed.
2022-03-24 10:11:21 : Computing Integration in 1 Integration Blocks!, 0 mins elapsed.
2022-03-24 10:11:21 : Block (1 of 1) : Computing Integration, 0 mins elapsed.
2022-03-24 10:11:25 : Block (1 of 1) : Identifying Variable Genes, 0.063 mins elapsed.
2022-03-24 10:11:29 : Block (1 of 1) : Getting GeneScoreMatrix, 0.133 mins elapsed.

2022-03-24 10:11:37 : GeneScoreMat-Block-1, Class = dgCMatrix
GeneScoreMat-Block-1: nRows = 2000, nCols = 10250
GeneScoreMat-Block-1: NonZeroEntries = 3684343, EntryRange = [ 0.022 , 229.174 ]
5 x 5 sparse Matrix of class "dgCMatrix"
         scATAC_BMMC_R1#TCACCTGTCCTCCTGA-1 scATAC_PBMC_R1#CGTTCCACAGAGATGC-1 scATAC_CD34_BMMC_R1#GCTCACTAGTCGTACT-1 scATAC_PBMC_R1#GTCTACCAGGATCCTT-1 scATAC_PBMC_R1#ATGTCTTTCCATAACG-1
HES4                                 .                                 2.246                                  .                                     .                             .
ISG15                                .                                 .                                      .                                     .                             3.644
TNFRSF18                             4.124                             .                                      1.913                                 .                             .
TNFRSF4                              .                                 .                                      2.249                                 .                             .
CALML6                               .                                 .                                      .                                     .                             .

2022-03-24 10:11:37 : Block (1 of 1) : Imputing GeneScoreMatrix, 0.266 mins elapsed.

2022-03-24 10:11:37 : addImputeWeights Input-Parameters, Class = list

addImputeWeights Input-Parameters$ArchRProj: length = 1

addImputeWeights Input-Parameters$reducedDims: length = 1
[1] "IterativeLSI"

addImputeWeights Input-Parameters$dimsToUse: length = 30
[1] 1 2 3 4 5 6

addImputeWeights Input-Parameters$scaleDims: length = 0
NULL

addImputeWeights Input-Parameters$corCutOff: length = 1
[1] 0.75

addImputeWeights Input-Parameters$td: length = 1
[1] 3

addImputeWeights Input-Parameters$ka: length = 1
[1] 4

addImputeWeights Input-Parameters$sampleCells: length = 1
[1] 5000

addImputeWeights Input-Parameters$nRep: length = 1
[1] 2

addImputeWeights Input-Parameters$k: length = 1
[1] 15

addImputeWeights Input-Parameters$epsilon: length = 1
[1] 1

addImputeWeights Input-Parameters$useHdf5: length = 1
[1] TRUE

addImputeWeights Input-Parameters$randomSuffix: length = 1
[1] TRUE

addImputeWeights Input-Parameters$threads: length = 1
[1] 1

addImputeWeights Input-Parameters$seed: length = 1
[1] 1

addImputeWeights Input-Parameters$verbose: length = 1
[1] TRUE

addImputeWeights Input-Parameters$logFile: length = 1
[1] "ArchRLogs/ArchR-addGeneIntegrationMatrix-ac5dcbaebc-Date-2022-03-24_Time-10-11-07.log"

2022-03-24 10:11:37 : Computing Impute Weights Using Magic (Cell 2018), 0 mins elapsed.
2022-03-24 10:11:37 : Computing Partial Diffusion Matrix with Magic (1 of 2), 0 mins elapsed.
2022-03-24 10:11:46 : Computing Partial Diffusion Matrix with Magic (2 of 2), 0.158 mins elapsed.
2022-03-24 10:11:56 : Completed Getting Magic Weights!, 0.32 mins elapsed.

2022-03-24 10:11:56 : imputeMatrix Input-Parameters, Class = list

imputeMatrix Input-Parameters$mat: nRows = 2000, nCols = 10250
imputeMatrix Input-Parameters$mat: NonZeroEntries = 3684343, EntryRange = [ 0.022 , 229.174 ]
5 x 5 sparse Matrix of class "dgCMatrix"
         scATAC_BMMC_R1#TCACCTGTCCTCCTGA-1 scATAC_PBMC_R1#CGTTCCACAGAGATGC-1 scATAC_CD34_BMMC_R1#GCTCACTAGTCGTACT-1 scATAC_PBMC_R1#GTCTACCAGGATCCTT-1 scATAC_PBMC_R1#ATGTCTTTCCATAACG-1
HES4                                 .                                 2.246                                  .                                     .                             .
ISG15                                .                                 .                                      .                                     .                             3.644
TNFRSF18                             4.124                             .                                      1.913                                 .                             .
TNFRSF4                              .                                 .                                      2.249                                 .                             .
CALML6                               .                                 .                                      .                                     .                             .

imputeMatrix Input-Parameters$threads: length = 1
[1] 14

imputeMatrix Input-Parameters$verbose: length = 1
[1] FALSE

imputeMatrix Input-Parameters$logFile: length = 1
[1] "ArchRLogs/ArchR-addGeneIntegrationMatrix-ac5dcbaebc-Date-2022-03-24_Time-10-11-07.log"

2022-03-24 10:11:56 : mat, Class = dgCMatrix
mat: nRows = 2000, nCols = 10250
mat: NonZeroEntries = 3684343, EntryRange = [ 0.022 , 229.174 ]
5 x 5 sparse Matrix of class "dgCMatrix"
         scATAC_BMMC_R1#TCACCTGTCCTCCTGA-1 scATAC_PBMC_R1#CGTTCCACAGAGATGC-1 scATAC_CD34_BMMC_R1#GCTCACTAGTCGTACT-1 scATAC_PBMC_R1#GTCTACCAGGATCCTT-1 scATAC_PBMC_R1#ATGTCTTTCCATAACG-1
HES4                                 .                                 2.246                                  .                                     .                             .
ISG15                                .                                 .                                      .                                     .                             3.644
TNFRSF18                             4.124                             .                                      1.913                                 .                             .
TNFRSF4                              .                                 .                                      2.249                                 .                             .
CALML6                               .                                 .                                      .                                     .                             .

2022-03-24 10:11:56 : weightList, Class = SimpleList

weightList$w1: length = 1
[1] "HemeTutorial/ImputeWeights/Impute-Weights-ac2e421c7e-Date-2022-03-24_Time-10-11-37-Rep-1"

weightList$w2: length = 1
[1] "HemeTutorial/ImputeWeights/Impute-Weights-ac2e421c7e-Date-2022-03-24_Time-10-11-37-Rep-2"

2022-03-24 10:11:56 : Imputing Matrix (1 of 2), 0 mins elapsed.

2022-03-24 10:11:56 :

2022-03-24 10:11:56 :

2022-03-24 10:11:56 :
2022-03-24 10:12:05 : Imputing Matrix (2 of 2), 0.15 mins elapsed.

2022-03-24 10:12:05 :

2022-03-24 10:12:05 :

2022-03-24 10:12:05 :
2022-03-24 10:12:14 : Finished Imputing Matrix, 0.304 mins elapsed.

2022-03-24 10:12:14 : GeneScoreMat-Block-Impute-1, Class = dgeMatrix
GeneScoreMat-Block-Impute-1: nRows = 2000, nCols = 10250
GeneScoreMat-Block-Impute-1: NonZeroEntries = 20500000, EntryRange = [ 0 , 9.2441243685058 ]
5 x 5 Matrix of class "dgeMatrix"
         scATAC_BMMC_R1#TCACCTGTCCTCCTGA-1 scATAC_PBMC_R1#CGTTCCACAGAGATGC-1 scATAC_CD34_BMMC_R1#GCTCACTAGTCGTACT-1 scATAC_PBMC_R1#GTCTACCAGGATCCTT-1 scATAC_PBMC_R1#ATGTCTTTCCATAACG-1
HES4                             0.7374332                         0.8325161                              0.4844393                         1.1526103                         0.8610469
ISG15                            0.6732143                         0.7313002                              0.4624542                         0.5490451                         0.8563325
TNFRSF18                         1.0225179                         0.9803957                              0.9646273                         0.2038027                         0.7665383
TNFRSF4                          0.2278850                         0.6113133                              0.4263749                         0.2423220                         0.4514156
CALML6                           2.0809765                         1.0019240                              1.1559091                         1.0064192                         1.2641407

2022-03-24 10:12:21 : Block (1 of 1) : Seurat FindTransferAnchors, 1.004 mins elapsed.

2022-03-24 10:14:14 : transferAnchors-1, Class = character

transferAnchors-1: length = 1
[1] "An AnchorSet object containing 1387 anchors between the reference and query Seurat objects. \n This can be used as input to TransferData."

2022-03-24 10:14:14 : rDSub-1, Class = matrix

2022-03-24 10:14:14 : rDSub-1, Class = array
                                            LSI1       LSI2       LSI3       LSI4      LSI5
scATAC_BMMC_R1#TCACCTGTCCTCCTGA-1      -4.129696 -0.5075478 -0.9600561  2.4942942 0.9356967
scATAC_PBMC_R1#CGTTCCACAGAGATGC-1      -5.034763 -1.0731695  0.5978558  0.7327498 0.3442506
scATAC_CD34_BMMC_R1#GCTCACTAGTCGTACT-1 -4.044496 -0.2141475 -1.5874410  1.6677084 0.3994403
scATAC_PBMC_R1#GTCTACCAGGATCCTT-1      -4.638248  1.9784523  1.1754489 -0.1663717 0.4113733
scATAC_PBMC_R1#ATGTCTTTCCATAACG-1      -4.851607 -1.5353766  0.8343660  0.0959220 0.3402740

rDSub-1: nRows = 10250, nCols = 30

2022-03-24 10:14:14 : Block (1 of 1) : Seurat TransferData Cell Group Labels, 2.885 mins elapsed.
2022-03-24 10:14:16 : Block (1 of 1) : Seurat TransferData Cell Names Labels, 2.926 mins elapsed.
2022-03-24 10:14:45 : Block (1 of 1) : Saving TransferAnchors Joint CCA, 3.411 mins elapsed.
2022-03-24 10:14:47 : Block (1 of 1) : Completed Integration, 3.434 mins elapsed.
2022-03-24 10:14:48 : Block (1 of 1) : Plotting Joint UMAP, 3.446 mins elapsed.
2022-03-24 10:15:19 : Completed Integration with RNA Matrix, 3.967 mins elapsed.

------- Completed

End Time : 2022-03-24 10:15:19
Elapsed Time Minutes = 4.20759488344193
Elapsed Time Hours = 0.0701266454988056

-------

Thanks!

Zepeng-Mu commented 2 years ago

Hi, I'm wondering whether this is because of some stochasticity in Seurat TransferData and it is run twice to get cellGroup and cellName? https://github.com/GreenleafLab/ArchR/blob/968e4421ce7187a8ac7ea1cf6077412126876d5f/R/RNAIntegration.R#L472-L479

rcorces commented 2 years ago

That seems like a likely culprit. Thanks for pointing that out. We will take a look. Sorry its taking some time but we will get to this eventually.