NathanSkene / EWCE

Expression Weighted Celltype Enrichment. See the package website for up-to-date instructions on usage.
https://nathanskene.github.io/EWCE/index.html
53 stars 25 forks source link

background not appropriate when the `sct_data` was previously converted from another species. #80

Closed bschilder closed 1 year ago

bschilder commented 1 year ago

1. Bug description

I just found an issue that can occur when the CTD has previously been converted from one species to another (e.g. mouse genes to human orthologs) and then fed into bootstrap_enrichment_test. The user will correctly set sctSpecies to "human", as the CTD is already in human gene format. But this same argument gets passed to orthogene::create_background, which intreprets the CTD as having originally come from a human, which is didn't. To create the appropriate background, you would need to take into account that the CTD originally comes from mouse.

Note that this is not a problem when the CTD is passed in as the original mouse genes and sctSpecies="mouse", bc the function internally does the mouse --> human conversion (and resassigning the sctSpecies variable) after creating the background. This how we demonstrate to use EWCE in the function example.

Expected behaviour

Appropriate background is chosen regardless of current ortholog format of the CTD.

2. Reproducible example

Code

sct_data <- ewceData::ctd()
sct_data2 <- EWCE::standardise_ctd(ctd = sct_data, input_species = "mouse", output_species = "human")
hits <- ewceData::example_genelist()

full_results <- EWCE::bootstrap_enrichment_test(
    sct_data = sct_data,
    hits = hits,
    reps = 10,
    annotLevel = 1,
    sctSpecies = "human",
    genelistSpecies = "human")

Console output

Because the background is wrong, the number of remaining hit genes is 0, and an error occurs.

1 core(s) assigned as workers (11 reserved).
Retrieving all genes using: homologene.
Retrieving all organisms available in homologene.
Mapping species name: human
Common name mapping found for human
1 organism identified from search: 9606
Gene table with 19,129 rows retrieved.
Returning all 19,129 genes from human.
Returning 19,129 unique genes from entire human genome.
Standardising CellTypeDataset
Checking gene list inputs.
Returning 19,129 unique genes from the user-supplied bg.
Standardising sct_data.
Converting gene list input to standardised human genes.
Running without gene size control.
0 hit gene(s) remain after filtering.
Computing gene scores.
Using previously sampled genes.
Computing gene counts.
Error in (function (..., sorted = TRUE, unique = FALSE)  : 
  attempt to set an attribute on NULL

3. Session info

``` R version 4.2.1 (2022-06-23) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Ventura 13.2.1 Matrix products: default LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] EWCE_1.7.2 ewceData_1.7.1 ExperimentHub_2.6.0 AnnotationHub_3.6.0 BiocFileCache_2.6.1 [6] dbplyr_2.3.2 BiocGenerics_0.44.0 RNOmni_1.0.1 loaded via a namespace (and not attached): [1] colorspace_2.1-0 ggtree_3.6.2 grr_0.9.5 [4] ggsignif_0.6.4 ellipsis_0.3.2 XVector_0.38.0 [7] GenomicRanges_1.50.2 aplot_0.1.10 rstudioapi_0.14 [10] ggpubr_0.6.0 bit64_4.0.5 interactiveDisplayBase_1.36.0 [13] AnnotationDbi_1.60.2 fansi_1.0.4 orthogene_1.5.2 [16] diffobj_0.3.5 codetools_0.2-19 cachem_1.0.7 [19] knitr_1.42 pkgload_1.3.2 jsonlite_1.8.4 [22] broom_1.0.4 annotate_1.76.0 png_0.1-8 [25] graph_1.76.0 shiny_1.7.4 BiocManager_1.30.20 [28] compiler_4.2.1 httr_1.4.5 backports_1.4.1 [31] lazyeval_0.2.2 Matrix_1.5-3 fastmap_1.1.1 [34] limma_3.54.2 cli_3.6.1 later_1.3.0 [37] htmltools_0.5.4 tools_4.2.1 gtable_0.3.3 [40] glue_1.6.2 GenomeInfoDbData_1.2.9 reshape2_1.4.4 [43] dplyr_1.1.1 rappdirs_0.3.3 Rcpp_1.0.10 [46] carData_3.0-5 Biobase_2.58.0 vctrs_0.6.1 [49] Biostrings_2.66.0 babelgene_22.9 ape_5.7-1 [52] nlme_3.1-162 xfun_0.37 stringr_1.5.0 [55] mime_0.12 lifecycle_1.0.3 rstatix_0.7.2 [58] XML_3.99-0.14 zlibbioc_1.44.0 scales_1.2.1 [61] MAST_1.24.1 promises_1.2.0.1 MatrixGenerics_1.10.0 [64] parallel_4.2.1 SummarizedExperiment_1.28.0 gprofiler2_0.2.1 [67] SingleCellExperiment_1.20.1 yaml_2.3.7 curl_5.0.0 [70] memoise_2.0.1 ggplot2_3.4.1 ggfun_0.0.9 [73] yulab.utils_0.0.6 stringi_1.7.12 RSQLite_2.3.0 [76] BiocVersion_3.16.0 S4Vectors_0.36.2 tidytree_0.4.2 [79] filelock_1.0.2 BiocParallel_1.32.6 GenomeInfoDb_1.34.9 [82] rlang_1.1.0 pkgconfig_2.0.3 matrixStats_0.63.0 [85] bitops_1.0-7 evaluate_0.20 lattice_0.20-45 [88] purrr_1.0.1 htmlwidgets_1.6.2 treeio_1.23.1 [91] patchwork_1.1.2 bit_4.0.5 tidyselect_1.2.0 [94] plyr_1.8.8 magrittr_2.0.3 R6_2.5.1 [97] IRanges_2.32.0 generics_0.1.3 DelayedArray_0.24.0 [100] DBI_1.1.3 withr_2.5.0 pillar_1.9.0 [103] KEGGREST_1.38.0 abind_1.4-5 RCurl_1.98-1.10 [106] tibble_3.2.1 homologene_1.4.68.19.3.27 crayon_1.5.2 [109] car_3.1-1 utf8_1.2.3 plotly_4.10.1 [112] rmarkdown_2.20.1 grid_4.2.1 data.table_1.14.8 [115] blob_1.2.4 digest_0.6.31 xtable_1.8-4 [118] HGNChelper_0.8.1 tidyr_1.3.0 httpuv_1.6.9 [121] gridGraphics_0.5-1 stats4_4.2.1 munsell_0.5.0 [124] viridisLite_0.4.1 ggplotify_0.1.0 ```
bschilder commented 1 year ago

Solution

I've added a new argument called sctSpecies_origin, which allows users to independently specify the current gene format of the CTD as well as the original species that the CTD came from.

This chooses the correct background (mouse + human) and keeps 17 gene hits.

Example

sct_data <- ewceData::ctd()
sct_data2 <- EWCE::standardise_ctd(ctd = sct_data, input_species = "mouse", output_species = "human")
hits <- ewceData::example_genelist()

full_results <- EWCE::bootstrap_enrichment_test(
    sct_data = sct_data2,
    hits = hits,
    reps = 10,
    annotLevel = 1,
    sctSpecies = "human",
    sctSpecies_origin = "mouse",
    genelistSpecies = "human")
1 core(s) assigned as workers (11 reserved).
Generating gene background for mouse x human ==> human
Gathering ortholog reports.
Retrieving all genes using: homologene.
Retrieving all organisms available in homologene.
Mapping species name: human
Common name mapping found for human
1 organism identified from search: 9606
Gene table with 19,129 rows retrieved.
Returning all 19,129 genes from human.

-- mouse
Retrieving all genes using: homologene.
Retrieving all organisms available in homologene.
Mapping species name: mouse
Common name mapping found for mouse
1 organism identified from search: 10090
Gene table with 21,207 rows retrieved.
Returning all 21,207 genes from mouse.
--
--
Preparing gene_df.
data.frame format detected.
Extracting genes from Gene.Symbol.
21,207 genes extracted.
Converting mouse ==> human orthologs using: homologene
Retrieving all organisms available in homologene.
Mapping species name: mouse
Common name mapping found for mouse
1 organism identified from search: 10090
Retrieving all organisms available in homologene.
Mapping species name: human
Common name mapping found for human
1 organism identified from search: 9606
Checking for genes without orthologs in human.
Extracting genes from input_gene.
17,355 genes extracted.
Extracting genes from ortholog_gene.
17,355 genes extracted.
Checking for genes without 1:1 orthologs.
Dropping 131 genes that have multiple input_gene per ortholog_gene (many:1).
Dropping 498 genes that have multiple ortholog_gene per input_gene (1:many).
Filtering gene_df with gene_map
Adding input_gene col to gene_df.
Adding ortholog_gene col to gene_df.

=========== REPORT SUMMARY ===========

Total genes dropped after convert_orthologs :
   4,725 / 21,207 (22%)
Total genes remaining after convert_orthologs :
   16,482 / 21,207 (78%)
--

=========== REPORT SUMMARY ===========

16,482 / 21,207 (77.72%) target_species genes remain after ortholog conversion.
16,482 / 19,129 (86.16%) reference_species genes remain after ortholog conversion.
Gathering ortholog reports.
Retrieving all genes using: homologene.
Retrieving all organisms available in homologene.
Mapping species name: human
Common name mapping found for human
1 organism identified from search: 9606
Gene table with 19,129 rows retrieved.
Returning all 19,129 genes from human.

-- human
Retrieving all genes using: homologene.
Retrieving all organisms available in homologene.
Mapping species name: human
Common name mapping found for human
1 organism identified from search: 9606
Gene table with 19,129 rows retrieved.
Returning all 19,129 genes from human.
--

=========== REPORT SUMMARY ===========

19,129 / 19,129 (100%) target_species genes remain after ortholog conversion.
19,129 / 19,129 (100%) reference_species genes remain after ortholog conversion.
16,482 intersect background genes used.
Standardising CellTypeDataset
Checking gene list inputs.
Running without gene size control.
17 hit gene(s) remain after filtering.
Computing gene scores.
Using previously sampled genes.
Computing gene counts.
Testing for enrichment in 7 cell types...
Sorting results by p-value.
Computing BH-corrected q-values.
2 significant cell type enrichment results @ q<0.05 : 

              CellType annotLevel p fold_change sd_from_mean q
1            microglia          1 0    1.892713     4.874007 0
2 astrocytes_ependymal          1 0    1.212996     1.074574 0
bschilder commented 1 year ago

Just to confirm, this is the same results you get if you run the mapping internally:

sct_data <- ewceData::ctd() 
hits <- ewceData::example_genelist()

full_results <- EWCE::bootstrap_enrichment_test(
    sct_data = sct_data,
    hits = hits,
    reps = 10,
    annotLevel = 1,
    sctSpecies = "mouse",
    genelistSpecies = "human")
1 core(s) assigned as workers (11 reserved).
Generating gene background for mouse x human ==> human
Gathering ortholog reports.
Retrieving all genes using: homologene.
Retrieving all organisms available in homologene.
Mapping species name: human
Common name mapping found for human
1 organism identified from search: 9606
Gene table with 19,129 rows retrieved.
Returning all 19,129 genes from human.

-- mouse
Retrieving all genes using: homologene.
Retrieving all organisms available in homologene.
Mapping species name: mouse
Common name mapping found for mouse
1 organism identified from search: 10090
Gene table with 21,207 rows retrieved.
Returning all 21,207 genes from mouse.
--
--
Preparing gene_df.
data.frame format detected.
Extracting genes from Gene.Symbol.
21,207 genes extracted.
Converting mouse ==> human orthologs using: homologene
Retrieving all organisms available in homologene.
Mapping species name: mouse
Common name mapping found for mouse
1 organism identified from search: 10090
Retrieving all organisms available in homologene.
Mapping species name: human
Common name mapping found for human
1 organism identified from search: 9606
Checking for genes without orthologs in human.
Extracting genes from input_gene.
17,355 genes extracted.
Extracting genes from ortholog_gene.
17,355 genes extracted.
Checking for genes without 1:1 orthologs.
Dropping 131 genes that have multiple input_gene per ortholog_gene (many:1).
Dropping 498 genes that have multiple ortholog_gene per input_gene (1:many).
Filtering gene_df with gene_map
Adding input_gene col to gene_df.
Adding ortholog_gene col to gene_df.

=========== REPORT SUMMARY ===========

Total genes dropped after convert_orthologs :
   4,725 / 21,207 (22%)
Total genes remaining after convert_orthologs :
   16,482 / 21,207 (78%)
--

=========== REPORT SUMMARY ===========

16,482 / 21,207 (77.72%) target_species genes remain after ortholog conversion.
16,482 / 19,129 (86.16%) reference_species genes remain after ortholog conversion.
Gathering ortholog reports.
Retrieving all genes using: homologene.
Retrieving all organisms available in homologene.
Mapping species name: human
Common name mapping found for human
1 organism identified from search: 9606
Gene table with 19,129 rows retrieved.
Returning all 19,129 genes from human.

-- human
Retrieving all genes using: homologene.
Retrieving all organisms available in homologene.
Mapping species name: human
Common name mapping found for human
1 organism identified from search: 9606
Gene table with 19,129 rows retrieved.
Returning all 19,129 genes from human.
--

=========== REPORT SUMMARY ===========

19,129 / 19,129 (100%) target_species genes remain after ortholog conversion.
19,129 / 19,129 (100%) reference_species genes remain after ortholog conversion.
16,482 intersect background genes used.
Standardising CellTypeDataset
Checking gene list inputs.
Running without gene size control.
17 hit gene(s) remain after filtering.
Computing gene scores.
Using previously sampled genes.
Computing gene counts.
Testing for enrichment in 7 cell types...
Sorting results by p-value.
Computing BH-corrected q-values.
2 significant cell type enrichment results @ q<0.05 : 

              CellType annotLevel p fold_change sd_from_mean q
1            microglia          1 0    1.977958     3.654319 0
2 astrocytes_ependymal          1 0    1.371805     2.678270 0

Indeed, the results are the same (in that the same celltypes are enriched). Increasing the bootstrap iterations would likely make the specific numbers more convergent.

bschilder commented 1 year ago

Note that this may have affected some of your analyses @KittyMurphy, depending on how you ran them