NathanSkene / EWCE

Expression Weighted Celltype Enrichment. See the package website for up-to-date instructions on usage.
https://nathanskene.github.io/EWCE/index.html
53 stars 25 forks source link

Errors when running a conditional analysis on level 2 data #93

Closed JoshHarveyGit closed 3 months ago

JoshHarveyGit commented 3 months ago

I'm running into errors when attemping to run a conditional enrichment test at level 2 controlling for a particular cell type. This appears to be an issue even when using the example data in EWCE. An example of the code and errors using the example data is as follows

testCtd <- ewceData::ctd()
exampleGeneSet <- rownames(testCtd[[2]]$specificity)[1:50]

# This section runs fine
conditional_results_Level1 <- EWCE::bootstrap_enrichment_test(
  sct_data = testCtd,
  hits = exampleGeneSet,
  sctSpecies = "mouse",
  genelistSpecies = "mouse",
  reps = 100,
  annotLevel = 1, no_cores = 20,controlledCT = "microglia")

#But testing at level 2 

conditional_results_Level2 <- EWCE::bootstrap_enrichment_test(
  sct_data = testCtd,
  hits = exampleGeneSet,
  sctSpecies = "mouse",
  genelistSpecies = "mouse",
  reps = 100,
  annotLevel = 2, no_cores = 20,controlledCT = "Oligo4")
#Returns the error: 

# Error in check_bootstrap_args(sct_data = sct_data, hits = hits, annotLevel = annotLevel,  :
#                                 invalid celltype name passed in controlledCT. This argument is optional. Leave empty if you do not wish to control for a celltypes expression.

`

Interestingly if you control for a cell type present in level 1 but specify annotLevel = 2 it will run for longer but get stuck on the "Generating controlled bootstrap gene sets" step (code below). Appears to potentially be an indexing issue?

conditional_results_Level1_2 <- EWCE::bootstrap_enrichment_test(
  sct_data = testCtd,
  hits = exampleGeneSet,
  sctSpecies = "mouse",
  genelistSpecies = "mouse",
  reps = 100,
  annotLevel = 2, no_cores = 20,controlledCT = "microglia")
#Returns the error: 

#   =========== REPORT SUMMARY ===========
#   
#   16,482 / 21,207 (77.72%) target_species genes remain after ortholog conversion.
# 16,482 / 19,129 (86.16%) reference_species genes remain after ortholog conversion.
# 16,482 intersect background genes used.
# Standardising CellTypeDataset
# Converting to sparse matrix.
# Converting to sparse matrix.
# Checking gene list inputs.
# Returning 16,482 unique genes from the user-supplied bg.
# Standardising sct_data.
# Converting gene list input to standardised human genes.
# Running without gene size control.
# 45 hit genes remain after filtering.
# Generating controlled bootstrap gene sets.
# Error in check_generate_controlled_bootstrap_geneset(controlledCT = controlledCT,  :
# ERROR: not all controlledCT are in colnames(sct_data[[annotLevel]]$specificity)
`

For now I'm working around this by remaking my ctd dataset with the level 2 annotations I'm interested in as level 1, but just flagging up incase.

Al-Murphy commented 3 months ago

Thanks for this @JoshHarveyGit, it is definitely a bug resulting from card coding level one in the check_bootstrap_args() function - not sure how this wasn't captured in any unit tests/previous checks. I've fixed this i the push for v1.11.5, you can get this from github master branch now or it will be up on bioc devel in a few days. Feel free to reopen if this persists but its working for me now:

> conditional_results_Level2 <- EWCE::bootstrap_enrichment_test(
+   sct_data = testCtd,
+   hits = exampleGeneSet,
+   sctSpecies = "mouse",
+   genelistSpecies = "mouse",
+   reps = 100,
+   annotLevel = 2, no_cores = 20,controlledCT = "Oligo4")
20 core(s) assigned as workers (-8 reserved).
Generating gene background for mouse x mouse ==> human
Gathering ortholog reports.
Retrieving all genes using: homologene.
Retrieving all organisms available in homologene.
Mapping species name: human
Common name mapping found for human
1 organism identified from search: 9606
Gene table with 19,129 rows retrieved.
Returning all 19,129 genes from human.

-- mouse
Retrieving all genes using: homologene.
Retrieving all organisms available in homologene.
Mapping species name: mouse
Common name mapping found for mouse
1 organism identified from search: 10090
Gene table with 21,207 rows retrieved.
Returning all 21,207 genes from mouse.
--
--
Preparing gene_df.
data.frame format detected.
Extracting genes from Gene.Symbol.
21,207 genes extracted.
Converting mouse ==> human orthologs using: homologene
Retrieving all organisms available in homologene.
Mapping species name: mouse
Common name mapping found for mouse
1 organism identified from search: 10090
Retrieving all organisms available in homologene.
Mapping species name: human
Common name mapping found for human
1 organism identified from search: 9606
Checking for genes without orthologs in human.
Extracting genes from input_gene.
17,355 genes extracted.
Extracting genes from ortholog_gene.
17,355 genes extracted.
Checking for genes without 1:1 orthologs.
Dropping 131 genes that have multiple input_gene per ortholog_gene (many:1).
Dropping 498 genes that have multiple ortholog_gene per input_gene (1:many).
Filtering gene_df with gene_map
Adding input_gene col to gene_df.
Adding ortholog_gene col to gene_df.

=========== REPORT SUMMARY ===========

Total genes dropped after convert_orthologs :
   4,725 / 21,207 (22%)
Total genes remaining after convert_orthologs :
   16,482 / 21,207 (78%)
--

=========== REPORT SUMMARY ===========

16,482 / 21,207 (77.72%) target_species genes remain after ortholog conversion.
16,482 / 19,129 (86.16%) reference_species genes remain after ortholog conversion.
16,482 intersect background genes used.
Standardising CellTypeDataset
Checking gene list inputs.
Converting gene list input to standardised human genes.
Running without gene size control.
45 hit gene(s) remain after filtering.
Generating controlled bootstrap gene sets.
Computing gene scores.
Using previously sampled genes.
Computing gene counts.
Testing for enrichment in 48 cell types...
Sorting results by p-value.
Computing BH-corrected q-values.
6 significant cell type enrichment results @ q<0.05 : 

  CellType annotLevel p fold_change sd_from_mean q
1    Int12          2 0    4.021952    15.311858 0
2    Int16          2 0    2.427964     8.845397 0
3    Int11          2 0    1.965309     5.812144 0
4     Int7          2 0    2.200507     5.682438 0
5    Int15          2 0    1.635682     4.218604 0
6    Int10          2 0    1.768792     3.994792 0