NathanSkene / EWCE

Expression Weighted Celltype Enrichment. See the package website for up-to-date instructions on usage.
https://nathanskene.github.io/EWCE/index.html
53 stars 25 forks source link

generate_bootstrap_plots gives error "in check_ewce_genelist_inputs" #85

Closed SuMi300 closed 8 months ago

SuMi300 commented 1 year ago

1. Bug description

Since you made some of the latest changes, the function 'generate_bootstrap_plots' keep giving me this error: "Error in check_ewce_genelist_inputs(sct_data = sct_data, hits = hits, : At least four genes which are present in the single cell dataset & background gene set are required to test for enrichment."

When I use the sct_data, my background list and my list of hits as input for the bootstrap_enrichment_test it works, no error. But with the exact same parameters as input for generate_bootstrap_plots, it gives the error.

I can't seem to figure out where the issue is, but I guess somehow either the background or the ctd_data get modified during processing and don't have comparable gene names anymore. Maybe by this part of what happens in the package?: _Standardising sct_data. Converting to sparse matrix. Aligning celltype names with standardisectd format.

Console output

> # Perform detailed analysis of EWCE enrichment in case of significant enrichment
> # Generate plots that show which proteins drive the cell type enrichment the most
> # Exact same input: hits list + bs list
> # And additional input: results from the bootstrap_enrichment_test
> generate_bootstrap_plots(sct_data=ctd, hits=hits$Genes_ONLYFIRST_MGI, bg=bg$Genes_ONLYFIRST_MGI, annotLevel=1,
+                          genelistSpecies="mouse", sctSpecies="mouse", output_species = "mouse", reps=20000,
+                          full_results=res.L1, listFileName="VignetteGraphs_L1")
Retrieving all genes using: homologene.
Retrieving all organisms available in homologene.
Mapping species name: mouse
Common name mapping found for mouse
1 organism identified from search: 10090
Gene table with 21,207 rows retrieved.
Returning all 21,207 genes from mouse.
Standardising sct_data.
Converting to sparse matrix.
Aligning celltype names with standardise_ctd format.
Error in check_ewce_genelist_inputs(sct_data = sct_data, hits = hits,  : 
  At least four genes which are present in the single cell dataset & background gene set are required to test for enrichment.

Expected behaviour

I would have expected the single cell dataset and background list to have identical gene names (mouse) so all background list proteins would be findable in the single cell datase and this function would work and give me graphs; I don't understand how the same input DOES work for when I do the bootstrap_enrichment_test just before.

2. Reproducible example

### Code 

# packages
library(readr)
library(readxl)
library(plyr)
library(data.table)
library(gplots)
library(tidyverse)
library(ggplot2)
library(RColorBrewer)

# EWCE packages
library(EWCE)
library(ewceData)
library(cowplot)
library(limma)

# Load scRNAseq cell type data, will appear as 'ctd'
load("CellTypeData_Hippocampus_Linnarsson_L1.rda")

# Load experimental data
# a background list of measured genes
# and a hits list of significantly (up)regulated genes
bg <- read_excel("data_in/DIA-NN_output_EVs_proteins_list_MGI.xlsx", sheet="R_input")
hits <- read_excel("data_in/new_significant_UP_chABC_TG(comparedtoWT)_MGI.xlsx", sheet="R_input")

# Perform EWCE (level1) on hits list + bg list
# make sure all species set to mouse; 20k reps for publication worthy
# will give cell type enrichment results @ q<0.05
res.L1 <- bootstrap_enrichment_test(sct_data=ctd, hits=hits$Genes_ONLYFIRST_MGI, annotLevel=1,
                                    bg=bg$Genes_ONLYFIRST_MGI, genelistSpecies="mouse",
                                    sctSpecies="mouse", output_species = "mouse", reps=20000)

# Perform detailed analysis of EWCE enrichment in case of significant enrichment
# Generate plots that show which proteins drive the cell type enrichment the most
# Exact same input: hits list + bs list
# And additional input: results from the bootstrap_enrichment_test
generate_bootstrap_plots(sct_data=ctd, hits=hits$Genes_ONLYFIRST_MGI, bg=bg$Genes_ONLYFIRST_MGI, annotLevel=1,
                         genelistSpecies="mouse", sctSpecies="mouse", output_species = "mouse", reps=20000,
                         full_results=res.L1, listFileName="VignetteGraphs_L1")

Data

ctd: list of 1, see screenshot image

bg: 5553 objects of 1 variable (1 excel column with mouse gene names(Chr type)) image

hits: 92 objects of 2 variables (2 excel columns, 1st with mouse gene names (Chr type), 2nd with fold change(num type)) image

3. Session info

(Add output of the R function utils::sessionInfo() below. This helps us assess version/OS conflicts which could be causing bugs.)

``` R version 4.2.3 (2023-03-15 ucrt) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19044) Matrix products: default locale: [1] LC_COLLATE=Dutch_Netherlands.utf8 LC_CTYPE=Dutch_Netherlands.utf8 LC_MONETARY=Dutch_Netherlands.utf8 LC_NUMERIC=C [5] LC_TIME=Dutch_Netherlands.utf8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] limma_3.52.4 cowplot_1.1.1 ewceData_1.4.0 ExperimentHub_2.4.0 AnnotationHub_3.4.0 BiocFileCache_2.4.0 dbplyr_2.3.2 BiocGenerics_0.42.0 [9] EWCE_1.4.0 RNOmni_1.0.1 RColorBrewer_1.1-3 lubridate_1.9.2 forcats_1.0.0 stringr_1.5.0 dplyr_1.1.1 purrr_1.0.1 [17] tidyr_1.3.0 tibble_3.2.1 ggplot2_3.4.2 tidyverse_2.0.0 gplots_3.1.3 data.table_1.14.8 plyr_1.8.8 readxl_1.4.2 [25] readr_2.1.4 loaded via a namespace (and not attached): [1] backports_1.4.1 lazyeval_0.2.2 orthogene_1.2.1 BiocParallel_1.30.4 GenomeInfoDb_1.32.4 [6] digest_0.6.31 yulab.utils_0.0.6 htmltools_0.5.5 fansi_1.0.4 magrittr_2.0.3 [11] memoise_2.0.1 tzdb_0.3.0 Biostrings_2.64.1 matrixStats_0.63.0 R.utils_2.12.2 [16] timechange_0.2.0 colorspace_2.1-0 blob_1.2.4 rappdirs_0.3.3 xfun_0.38 [21] callr_3.7.3 crayon_1.5.2 RCurl_1.98-1.12 jsonlite_1.8.4 ape_5.7-1 [26] glue_1.6.2 gtable_0.3.3 zlibbioc_1.42.0 XVector_0.36.0 HGNChelper_0.8.1 [31] DelayedArray_0.22.0 clipr_0.8.0 R.cache_0.16.0 car_3.1-2 SingleCellExperiment_1.18.1 [36] abind_1.4-5 scales_1.2.1 DBI_1.1.3 rstatix_0.7.2 Rcpp_1.0.10 [41] viridisLite_0.4.1 xtable_1.8-4 gridGraphics_0.5-1 tidytree_0.4.2 bit_4.0.5 [46] stats4_4.2.3 htmlwidgets_1.6.2 httr_1.4.5 ellipsis_0.3.2 pkgconfig_2.0.3 [51] R.methodsS3_1.8.2 utf8_1.2.3 ggplotify_0.1.0 tidyselect_1.2.0 rlang_1.1.0 [56] reshape2_1.4.4 later_1.3.0 AnnotationDbi_1.58.0 munsell_0.5.0 BiocVersion_3.15.2 [61] cellranger_1.1.0 tools_4.2.3 cachem_1.0.7 cli_3.6.1 generics_0.1.3 [66] RSQLite_2.3.1 broom_1.0.4 evaluate_0.20 fastmap_1.1.1 yaml_2.3.7 [71] ggtree_3.4.4 processx_3.8.0 knitr_1.42 babelgene_22.9 bit64_4.0.5 [76] fs_1.6.1 caTools_1.18.2 KEGGREST_1.36.3 gprofiler2_0.2.1 nlme_3.1-162 [81] mime_0.12 R.oo_1.25.0 aplot_0.1.10 compiler_4.2.3 rstudioapi_0.14 [86] plotly_4.10.1 filelock_1.0.2 curl_5.0.0 png_0.1-8 interactiveDisplayBase_1.34.0 [91] ggsignif_0.6.4 reprex_2.0.2 treeio_1.20.2 homologene_1.4.68.19.3.27 stringi_1.7.12 [96] ps_1.7.4 lattice_0.20-45 Matrix_1.5-4 styler_1.9.1 vctrs_0.6.1 [101] pillar_1.9.0 lifecycle_1.0.3 BiocManager_1.30.20 bitops_1.0-7 httpuv_1.6.9 [106] patchwork_1.1.2 GenomicRanges_1.48.0 R6_2.5.1 promises_1.2.0.1 KernSmooth_2.23-20 [111] IRanges_2.30.1 codetools_0.2-19 gtools_3.9.4 SummarizedExperiment_1.26.1 withr_2.5.0 [116] S4Vectors_0.34.0 GenomeInfoDbData_1.2.8 parallel_4.2.3 hms_1.1.3 grid_4.2.3 [121] ggfun_0.0.9 rmarkdown_2.21 MatrixGenerics_1.8.1 carData_3.0-5 ggpubr_0.6.0 [126] Biobase_2.56.0 shiny_1.7.4 ```
Al-Murphy commented 1 year ago

Hey! That seems odd, can you attach the rda dataset and excel you use in the example so I can try to reproduce it?

SuMi300 commented 1 year ago

Github_data.zip

Al-Murphy commented 1 year ago

Thanks for this! I can replicate the issue and have added a fix in live version v1.8.2 and dev version v1.9.2. The code will run with both now. The live version should be available in bioconductor after a few days but for now, feel free to install the dev version (master branch) directly from github. The issue was that we didn't pass all parameters to check_ewce_genelist_inputs in the generate_bootstrap_plots so some default values were used whereas in bootstrap_enrichment_test all parameters were passed.

Give it a try and let me know if there are any issues?

Cheers, Alan.

SuMi300 commented 1 year ago

Dear Alan,

A (very) belated thank you for looking at the issue. I have gotten back to it myself last week, and after updating many things in R (as always this takes time), I was very happy to see the above issue was fixed!

I have one remark about the code still: as I understand it, it's supposed to only plot (bootstrap)graphs for significant cell types? However, when I run it, it does still plot all cell types graphs in the end output. Even though it does tell me while running only 2 cell types are significant and remain.

Could you still have a look at that?

Thanks in advance, Cheers,

Suzanne

bschilder commented 11 months ago

I have one remark about the code still: as I understand it, it's supposed to only plot (bootstrap)graphs for significant cell types? However, when I run it, it does still plot all cell types graphs in the end output. Even though it does tell me while running only 2 cell types are significant and remain.

Hi @SuMi300 you are correct about this. This feature stopped working when I refactored the code recently. But I've just added it back in and seems to working now. You can install the latest version on GitHub. Apologies for that, and thanks for bringing it to our attention.