NathanSkene / EWCE

Expression Weighted Celltype Enrichment. See the package website for up-to-date instructions on usage.
https://nathanskene.github.io/EWCE/index.html
53 stars 25 forks source link

`genelistSpecies` not inherited by `prepare_genesize_control_network()` in `bootstrap_enrichment_test()` #69

Closed RHReynolds closed 2 years ago

RHReynolds commented 2 years ago

1. Bug description

When using bootstrap_enrichment_test() with the argument geneSizeControl = TRUE a warning is given that genelistSpecieshas not been provided and the default human will be used. I believe this is because the genelistSpecies argument has not been specified/inherited in prepare_genesize_control_network() (lines 152-7 in https://github.com/NathanSkene/EWCE/blob/master/R/bootstrap_enrichment_test.R).

Console output

1 core(s) assigned as workers (111reserved).
Generating gene background for mouse x human ==> human
Retrieving all genes using: homologene.
Retrieving all organisms available in homologene.
Mapping species name: mouse
Common name mapping found for mouse
1 organism identified from search: 10090
Gene table with 21,207 rows retrieved.
Returning all 21,207 genes from mouse.
--
Retrieving all genes using: homologene.
Retrieving all organisms available in homologene.
Mapping species name: human
Common name mapping found for human
1 organism identified from search: 9606
Gene table with 19,129 rows retrieved.
Returning all 19,129 genes from human.
--
Preparing gene_df.
data.frame format detected.
Extracting genes from Gene.Symbol.
21,207 genes extracted.
Converting mouse ==> human orthologs using: homologene
Retrieving all organisms available in homologene.
Mapping species name: mouse
Common name mapping found for mouse
1 organism identified from search: 10090
Retrieving all organisms available in homologene.
Mapping species name: human
Common name mapping found for human
1 organism identified from search: 9606
Checking for genes without orthologs in human.
Extracting genes from input_gene.
17,355 genes extracted.
Extracting genes from ortholog_gene.
17,355 genes extracted.
Checking for genes without 1:1 orthologs.
Dropping 131 genes that have multiple input_gene per ortholog_gene.
Dropping 498 genes that have multiple ortholog_gene per input_gene.
Filtering gene_df with gene_map
Adding input_gene col to gene_df.
Adding ortholog_gene col to gene_df.

=========== REPORT SUMMARY ===========

Total genes dropped after convert_orthologs :
   4,725 / 21,207 (22%)
Total genes remaining after convert_orthologs :
   16,482 / 21,207 (78%)
--

=========== REPORT SUMMARY ===========

16,482 / 21,207 (77.72%) target_species genes remain after ortholog conversion.
16,482 / 19,129 (86.16%) reference_species genes remain after ortholog conversion.
Retrieving all genes using: homologene.
Retrieving all organisms available in homologene.
Mapping species name: human
Common name mapping found for human
1 organism identified from search: 9606
Gene table with 19,129 rows retrieved.
Returning all 19,129 genes from human.
--

=========== REPORT SUMMARY ===========

19,129 / 19,129 (100%) target_species genes remain after ortholog conversion.
19,129 / 19,129 (100%) reference_species genes remain after ortholog conversion.
16,482 intersect background genes used.
Standardising CellTypeDataset
Converting to sparse matrix.
Converting to sparse matrix.
Checking gene list inputs.
Retrieving all genes using: homologene.
Retrieving all organisms available in homologene.
Mapping species name: human
Common name mapping found for human
1 organism identified from search: 9606
Gene table with 19,129 rows retrieved.
Returning all 19,129 genes from human.
Standardising sct_data.
Converting gene list input to standardised human genes.
Running with gene size control.
Warning: genelistSpecies not provided. Setting to 'human' by default.
Retrieving all genes using: gprofiler
Retrieving all organisms available in gprofiler.
Using stored `gprofiler_orgs`.
Mapping species name: human
Common name mapping found for human
1 organism identified from search: hsapiens
Gene table with 61,461 rows retrieved.
Returning all 61,461 genes from human.
Retrieving all genes using: gprofiler
Retrieving all organisms available in gprofiler.
Using stored `gprofiler_orgs`.
Mapping species name: human
Common name mapping found for human
1 organism identified from search: hsapiens
Gene table with 61,461 rows retrieved.
Returning all 61,461 genes from human.
61,461 human Ensembl IDs and 39,813 human Gene Symbols imported.
see ?ewceData and browseVignettes('ewceData') for documentation
loading from cache
Controlled bootstrapping network generated.
20 hit genes remain after filtering.
Computing summed proportions.
Testing for enrichment in 7 cell types...
Sorting results by p-value.
Computing BH-corrected q-values.
5 significant cell type enrichment results @ q<0.05 : 

              CellType annotLevel p fold_change sd_from_mean q
1 astrocytes_ependymal          1 0    2.043578     4.493685 0
2            microglia          1 0    3.107579     3.577835 0
3        pyramidal_CA1          1 0    1.479996     3.270410 0
4         pyramidal_SS          1 0    1.294647     3.073443 0
5     oligodendrocytes          1 0    1.447644     1.530268 0

Expected behaviour

I would have expected that the control network uses the same species as specified for my hit list.

2. Reproducible example

Code

I used the example provided in the manual page for bootstrap_enrichment_test(), but added in the argument to control for GC/length.

ctd <- ewceData::ctd()
reps <- 3
example_genelist <- ewceData::example_genelist()

# Bootstrap significance test, WITH control for transcript length or GC content
full_results <- EWCE::bootstrap_enrichment_test(
  sct_data = ctd,
  hits = example_genelist,
  reps = reps,
  annotLevel = 1,
  sctSpecies = "mouse",
  genelistSpecies = "human",
  geneSizeControl = TRUE
)

3. Session info

(Add output of the R function utils::sessionInfo() below. This helps us assess version/OS conflicts which could be causing bugs.)

``` R version 4.2.0 (2022-04-22) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.4 LTS Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 [6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] forcats_0.5.1 dplyr_1.0.9 purrr_0.3.4 readr_2.1.2 tidyr_1.2.0 tibble_3.1.7 [7] tidyverse_1.3.1 ggplot2_3.3.6 readxl_1.4.0 stringr_1.4.0 here_1.0.1 ewceData_1.5.0 [13] ExperimentHub_2.5.0 AnnotationHub_3.5.0 BiocFileCache_2.5.0 dbplyr_2.1.1 BiocGenerics_0.43.0 EWCE_1.5.2 [19] RNOmni_1.0.0 loaded via a namespace (and not attached): [1] backports_1.4.1 plyr_1.8.7 lazyeval_0.2.2 orthogene_1.3.0 [5] BiocParallel_1.31.6 GenomeInfoDb_1.33.3 digest_0.6.29 yulab.utils_0.0.4 [9] htmltools_0.5.2 fansi_1.0.3 magrittr_2.0.3 memoise_2.0.1 [13] tzdb_0.3.0 limma_3.53.2 Biostrings_2.65.0 modelr_0.1.8 [17] matrixStats_0.62.0 vroom_1.5.7 colorspace_2.0-3 blob_1.2.3 [21] rvest_1.0.2 rappdirs_0.3.3 haven_2.5.0 xfun_0.31 [25] crayon_1.5.1 RCurl_1.98-1.6 jsonlite_1.8.0 ape_5.6-2 [29] glue_1.6.2 gtable_0.3.0 zlibbioc_1.43.0 XVector_0.37.0 [33] HGNChelper_0.8.1 DelayedArray_0.23.0 car_3.0-13 SingleCellExperiment_1.19.0 [37] abind_1.4-5 scales_1.2.0 DBI_1.1.2 rstatix_0.7.0 [41] Rcpp_1.0.8.3 viridisLite_0.4.0 xtable_1.8-4 gridGraphics_0.5-1 [45] tidytree_0.3.9 bit_4.0.4 stats4_4.2.0 htmlwidgets_1.5.4 [49] httr_1.4.3 ellipsis_0.3.2 farver_2.1.0 pkgconfig_2.0.3 [53] utf8_1.2.2 labeling_0.4.2 ggplotify_0.1.0 tidyselect_1.1.2 [57] rlang_1.0.2 reshape2_1.4.4 later_1.3.0 AnnotationDbi_1.59.1 [61] munsell_0.5.0 BiocVersion_3.16.0 cellranger_1.1.0 tools_4.2.0 [65] cachem_1.0.6 cli_3.3.0 generics_0.1.2 RSQLite_2.2.14 [69] broom_0.8.0 evaluate_0.15 fastmap_1.1.0 yaml_2.3.5 [73] ggtree_3.5.0 babelgene_22.3 knitr_1.39 bit64_4.0.5 [77] fs_1.5.2 KEGGREST_1.37.0 gprofiler2_0.2.1 nlme_3.1-157 [81] mime_0.12 aplot_0.1.4 xml2_1.3.3 compiler_4.2.0 [85] rstudioapi_0.13 plotly_4.10.0 filelock_1.0.2 curl_4.3.2 [89] png_0.1-7 interactiveDisplayBase_1.35.0 ggsignif_0.6.3 reprex_2.0.1 [93] treeio_1.21.0 homologene_1.4.68.19.3.27 stringi_1.7.6 lattice_0.20-45 [97] Matrix_1.4-1 vctrs_0.4.1 pillar_1.7.0 lifecycle_1.0.1 [101] BiocManager_1.30.18 data.table_1.14.2 bitops_1.0-7 httpuv_1.6.5 [105] patchwork_1.1.1 GenomicRanges_1.49.0 R6_2.5.1 promises_1.2.0.1 [109] IRanges_2.31.0 assertthat_0.2.1 SummarizedExperiment_1.27.1 rprojroot_2.0.3 [113] withr_2.5.0 S4Vectors_0.35.0 GenomeInfoDbData_1.2.8 parallel_4.2.0 [117] hms_1.1.1 grid_4.2.0 ggfun_0.0.6 rmarkdown_2.14 [121] MatrixGenerics_1.9.0 carData_3.0-5 ggpubr_0.4.0 Biobase_2.57.1 [125] shiny_1.7.1 lubridate_1.8.0 ```
Al-Murphy commented 2 years ago

Hey! Thanks for your note on this, I believe your are correct and I'll just confirm before pushing a fix on the dev branch of bioconductor (master branch on github)

@bschilder do you agree this is just an oversight? There is no reason not to pass genelistSpecies to prepare_genesize_control_network from bootstrap_enrichment_test? - I know you worked on this last so just want to confirm

bschilder commented 2 years ago

Hi @RHReynolds I believe you are indeed correct on this, thanks for reporting it!

@Al-Murphy that would be great if you could go ahead and push the fix.

Al-Murphy commented 2 years ago

Updated in v 1.5.3, @RHReynolds you can get this now from the github master branch or alternatively, the bioconductor devel version in the next few days (usually 3 days)