Gene set scores correlate w/ cell content

dpcook commented 4 years ago

Hi there. Thanks for the work on this package! I love the incorporation of autocorrelation.

I've been looking into the scores produced by Vision (seurat object as input) and have found that that scores correlate fairly well with total UMI counts.

vision <- Vision(seurat[[i]],
                   signatures = signatures,
                   sig_gene_threshold = 0.05,
                   projection_methods=NULL)

I noticed that my signatures were all correlating with each other, so I checked their correlation with total UMI per cell and found that many correlate well

And this seems to be a related to the total size of the signature:

I only mention this because I believe the default sig_norm_method is supposed to deal with this

deto commented 4 years ago

You're correct - the sig_norm_method was designed to deal with this and so I'm a bit surprised, and would like to understand what's going on here.

I see you're using a Seurat object as input - what preprocessing steps had you run so far on that object? Are all the genes present, or has it been filtered yet?

I think this might be related to the sig_gene_threshold input that was added. If you have a chance, can you try re-running with sig_gene_threshold = .001 (the default), and see how that changes the correlation?

dpcook commented 4 years ago

Hey David--sorry about the delay. Just had a change to revisit this. Here are some details to help explore this further

So yes, I started with a seurat object comprising a "pure" population of cancer cells with the goal of using Vision for scoring and calculating autocorrelation. Given the purity of the population, I reasoned that I would expect meaningful genes to be detected in at least 5% of cells, so increased the sig_gene_threshold thinking it may make for cleaner results. Prior to Vision, the data was processed with a straight forward pipeline (QC filtering > SCTransform > PCA > UMAP > Cluster > Subset cancer cells > re-normalize with SCTransform > PCA > UMAP). It still contains all genes.

Previous run with sig_gene_threshold=0.05 on MSigDB Hallmark gene sets:

> vision <- Vision(seurat,
+                  signatures = "~/Data/GeneLists/hallmark.genesets.v6.1.symbols.gmt",
+                  sig_gene_threshold = 0.05,
+                  projection_methods=NULL)
Importing counts from obj[["RNA"]]@counts ...
Normalizing to counts per 10,000...
Importing Meta Data from obj@meta.data ...
Importing latent space from Embeddings(obj, "pca") using first 50 components
Loading data from ~/Data/GeneLists/hallmark.genesets.v6.1.symbols.gmt ...

Using 9419/21862 genes detected in 5.00% of cells for signature analysis.
See the `sig_gene_threshold` input to change this behavior.

Adding Visualization: Seurat_pca
Adding Visualization: Seurat_umap
> vision <- analyze(vision)
Beginning Analysis

Clustering cells...completed

Projecting data into 2 dimensions...

Evaluating signature scores on cells...

  |======================================================================================| 100%, Elapsed 00:00
Evaluating signature-gene importance...

  |======================================================================================| 100%, Elapsed 00:02
Creating 5 background signature groups with the following parameters:
  sigSize sigBalance
1      20  1.0000000
2      60  1.0000000
3     116  1.0000000
4     163  1.0000000
5     192  0.6053674
  signatures per group: 3000
Computing KNN Cell Graph in the Latent Space...

Evaluating local consistency of signatures in latent space...

  |======================================================================================| 100%, Elapsed 00:00
  |======================================================================================| 100%, Elapsed 01:07
  |======================================================================================| 100%, Elapsed 01:45
  |======================================================================================| 100%, Elapsed 00:01
Clustering signatures...

fitting ...
  |=====================================================================================================| 100%
Computing differential signature tests...

  |======================================================================================| 100%, Elapsed 00:00
  |======================================================================================| 100%, Elapsed 00:03
Computing correlations between signatures and latent space components...

  |======================================================================================| 100%, Elapsed 00:01
Analysis Complete!

> scores <- getSignatureScores(vision)
> hist(cor(scores, seurat$nCount_RNA), breaks=50)

Now re-running with default sig_gene_threshold:

> vision <- Vision(seurat,
+                  signatures = "~/Data/GeneLists/hallmark.genesets.v6.1.symbols.gmt",
+                  projection_methods=NULL)
Importing counts from obj[["RNA"]]@counts ...
Normalizing to counts per 10,000...
Importing Meta Data from obj@meta.data ...
Importing latent space from Embeddings(obj, "pca") using first 50 components
Loading data from ~/Data/GeneLists/hallmark.genesets.v6.1.symbols.gmt ...

Using 18828/21862 genes detected in 0.10% of cells for signature analysis.
See the `sig_gene_threshold` input to change this behavior.

Adding Visualization: Seurat_pca
Adding Visualization: Seurat_umap
> vision <- analyze(vision)
Beginning Analysis

Clustering cells...completed

Projecting data into 2 dimensions...

Evaluating signature scores on cells...

  |============================================================| 100%, Elapsed 00:00
Evaluating signature-gene importance...

  |============================================================| 100%, Elapsed 00:02
Creating 5 background signature groups with the following parameters:
  sigSize sigBalance
1      33  1.0000000
2      57  1.0000000
3      97  1.0000000
4     183  1.0000000
5     301  0.5428075
  signatures per group: 3000
Computing KNN Cell Graph in the Latent Space...

Evaluating local consistency of signatures in latent space...

  |============================================================| 100%, Elapsed 00:00
  |============================================================| 100%, Elapsed 00:36
  |============================================================| 100%, Elapsed 00:37
  |============================================================| 100%, Elapsed 00:00
Clustering signatures...

fitting ...
  |===========================================================================| 100%
Computing differential signature tests...

  |============================================================| 100%, Elapsed 00:00
  |============================================================| 100%, Elapsed 00:02
Computing correlations between signatures and latent space components...

  |============================================================| 100%, Elapsed 00:01
Analysis Complete!

> scores <- getSignatureScores(vision)
> hist(cor(scores, seurat$nCount_RNA), breaks=50)

Doesn't seem to improve the issue.

Looked at the distribution of the scores:

And then the relationship between mean score and how much the signature correlated with UMI (thinking that maybe it was only when scores were low or something)

In case you want to look at this specific example, I've uploaded this Seurat object and the hallmark gene set to a Google Drive you can access here

> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.6

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ggridges_0.5.2  forcats_0.5.0   stringr_1.4.0   dplyr_1.0.2     purrr_0.3.4     readr_1.3.1     tidyr_1.1.1     tibble_3.0.3    ggplot2_3.3.2  
[10] tidyverse_1.3.0 VISION_2.1.0    Seurat_3.2.0   

loaded via a namespace (and not attached):
  [1] Rtsne_0.15            colorspace_1.4-1      deldir_0.1-28         ellipsis_0.3.1        mclust_5.4.6          fs_1.5.0             
  [7] rstudioapi_0.11       spatstat.data_1.4-3   farver_2.0.3          leiden_0.3.3          listenv_0.8.0         ggrepel_0.8.2        
 [13] fansi_0.4.1           lubridate_1.7.9       xml2_1.3.2            codetools_0.2-16      splines_4.0.2         logging_0.10-108     
 [19] knitr_1.29            polyclip_1.10-0       jsonlite_1.7.0        broom_0.7.0           ica_1.0-2             cluster_2.1.0        
 [25] dbplyr_1.4.4          png_0.1-7             uwot_0.1.8            shiny_1.5.0           wordspace_0.2-6       sctransform_0.2.1    
 [31] plumber_0.4.6         compiler_4.0.2        httr_1.4.2            backports_1.1.8       assertthat_0.2.1      Matrix_1.2-18        
 [37] fastmap_1.0.1         lazyeval_0.2.2        cli_2.0.2             later_1.1.0.1         htmltools_0.5.0       tools_4.0.2          
 [43] rsvd_1.0.3            igraph_1.2.5          gtable_0.3.0          glue_1.4.1            RANN_2.6.1            reshape2_1.4.4       
 [49] Rcpp_1.0.5            spatstat_1.64-1       cellranger_1.1.0      vctrs_0.3.2           ape_5.4-1             nlme_3.1-148         
 [55] lmtest_0.9-37         xfun_0.16             globals_0.12.5        rvest_0.3.6           mime_0.9              miniUI_0.1.1.1       
 [61] lifecycle_0.2.0       irlba_2.3.3           goftest_1.2-2         future_1.18.0         MASS_7.3-52           zoo_1.8-8            
 [67] scales_1.1.1          loe_1.1               hms_0.5.3             promises_1.1.1        spatstat.utils_1.17-0 parallel_4.0.2       
 [73] RColorBrewer_1.1-2    reticulate_1.16       pbapply_1.4-3         gridExtra_2.3         rpart_4.1-15          fastICA_1.2-2        
 [79] stringi_1.4.6         permute_0.9-5         rlang_0.4.7           pkgconfig_2.0.3       matrixStats_0.56.0    lattice_0.20-41      
 [85] ROCR_1.0-11           tensor_1.5            labeling_0.3          patchwork_1.0.1       htmlwidgets_1.5.1     cowplot_1.0.0        
 [91] tidyselect_1.1.0      RcppAnnoy_0.0.16      plyr_1.8.6            magrittr_1.5          R6_2.4.1              generics_0.0.2       
 [97] DBI_1.1.0             withr_2.2.0           pillar_1.4.6          haven_2.3.1           mgcv_1.8-31           fitdistrplus_1.1-1   
[103] survival_3.2-3        abind_1.4-5           future.apply_1.6.0    modelr_0.1.8          crayon_1.3.4          utf8_1.1.4           
[109] KernSmooth_2.23-17    plotly_4.9.2.1        readxl_1.3.1          grid_4.0.2            data.table_1.13.0     blob_1.2.1           
[115] vegan_2.5-6           reprex_0.3.0          sparsesvd_0.2         digest_0.6.25         pbmcapply_1.5.0       xtable_1.8-4         
[121] httpuv_1.5.4          munsell_0.5.0         viridisLite_0.3.0     iotools_0.3-1        
> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.6

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ggridges_0.5.2  forcats_0.5.0   stringr_1.4.0   dplyr_1.0.2     purrr_0.3.4     readr_1.3.1     tidyr_1.1.1    
 [8] tibble_3.0.3    ggplot2_3.3.2   tidyverse_1.3.0 VISION_2.1.0    Seurat_3.2.0   

loaded via a namespace (and not attached):
  [1] Rtsne_0.15            colorspace_1.4-1      deldir_0.1-28         ellipsis_0.3.1        mclust_5.4.6         
  [6] fs_1.5.0              rstudioapi_0.11       spatstat.data_1.4-3   farver_2.0.3          leiden_0.3.3         
 [11] listenv_0.8.0         ggrepel_0.8.2         fansi_0.4.1           lubridate_1.7.9       xml2_1.3.2           
 [16] codetools_0.2-16      splines_4.0.2         logging_0.10-108      knitr_1.29            polyclip_1.10-0      
 [21] jsonlite_1.7.0        broom_0.7.0           ica_1.0-2             cluster_2.1.0         dbplyr_1.4.4         
 [26] png_0.1-7             uwot_0.1.8            shiny_1.5.0           wordspace_0.2-6       sctransform_0.2.1    
 [31] plumber_0.4.6         compiler_4.0.2        httr_1.4.2            backports_1.1.8       assertthat_0.2.1     
 [36] Matrix_1.2-18         fastmap_1.0.1         lazyeval_0.2.2        cli_2.0.2             later_1.1.0.1        
 [41] htmltools_0.5.0       tools_4.0.2           rsvd_1.0.3            igraph_1.2.5          gtable_0.3.0         
 [46] glue_1.4.1            RANN_2.6.1            reshape2_1.4.4        Rcpp_1.0.5            spatstat_1.64-1      
 [51] cellranger_1.1.0      vctrs_0.3.2           ape_5.4-1             nlme_3.1-148          lmtest_0.9-37        
 [56] xfun_0.16             globals_0.12.5        rvest_0.3.6           mime_0.9              miniUI_0.1.1.1       
 [61] lifecycle_0.2.0       irlba_2.3.3           goftest_1.2-2         future_1.18.0         MASS_7.3-52          
 [66] zoo_1.8-8             scales_1.1.1          loe_1.1               hms_0.5.3             promises_1.1.1       
 [71] spatstat.utils_1.17-0 parallel_4.0.2        RColorBrewer_1.1-2    reticulate_1.16       pbapply_1.4-3        
 [76] gridExtra_2.3         rpart_4.1-15          fastICA_1.2-2         stringi_1.4.6         permute_0.9-5        
 [81] rlang_0.4.7           pkgconfig_2.0.3       matrixStats_0.56.0    lattice_0.20-41       ROCR_1.0-11          
 [86] tensor_1.5            labeling_0.3          patchwork_1.0.1       htmlwidgets_1.5.1     cowplot_1.0.0        
 [91] tidyselect_1.1.0      RcppAnnoy_0.0.16      plyr_1.8.6            magrittr_1.5          R6_2.4.1             
 [96] generics_0.0.2        DBI_1.1.0             withr_2.2.0           pillar_1.4.6          haven_2.3.1          
[101] mgcv_1.8-31           fitdistrplus_1.1-1    survival_3.2-3        abind_1.4-5           future.apply_1.6.0   
[106] modelr_0.1.8          crayon_1.3.4          utf8_1.1.4            KernSmooth_2.23-17    plotly_4.9.2.1       
[111] readxl_1.3.1          grid_4.0.2            data.table_1.13.0     blob_1.2.1            vegan_2.5-6          
[116] reprex_0.3.0          sparsesvd_0.2         digest_0.6.25         pbmcapply_1.5.0       xtable_1.8-4         
[121] httpuv_1.5.4          munsell_0.5.0         viridisLite_0.3.0     iotools_0.3-1

YosefLab / VISION

Gene set scores correlate w/ cell content #86