aertslab / RcisTarget

RcisTarget: Transcription factor binding motif enrichment
34 stars 9 forks source link

No difference in enrichedGenes when using background #25

Closed gdagstn closed 2 years ago

gdagstn commented 2 years ago

Hi, first of all congratulations for the great tool!

I have run into what looks like a bug: when I supply a background for enrichment, the resulting nEnrGenes rankAtMax and enrichedGenes fields never change in the results table.

I tried it on my data but I see that even using the example in the vignette the problem is the same.

Here is a small reprex taken directly from the vignette:

library(RcisTarget)

featherFilePath <- "hg38__refseq-r80__10kb_up_and_down_tss.mc9nr.feather"

txtFile <- paste(file.path(system.file('examples', package='RcisTarget')),"hypoxiaGeneSet.txt", sep="/")
geneSets <- list(hypoxia=read.table(txtFile, stringsAsFactors=FALSE)[,1])

txtFile <- paste(file.path(system.file('examples', package='RcisTarget')),"randomGeneSet.txt", sep="/") # for the toy example we will use a few random genes
background <- read.table(txtFile, stringsAsFactors=FALSE)[,1]

background <- unique(c(geneSets$hypoxia, background))

rankingsDb <- importRankings(featherFilePath, columns=background)
bgRanking <- reRank(rankingsDb) 

test <- cisTarget(geneSets, bgRanking, aucMaxRank=0.03*getNumColsInDB(bgRanking))

length(unique(test$nEnrGenes)) == 1

# TRUE

length(unique(test$enrichedGenes)) == 1

# TRUE

When not using a background it works fine:

rankingsDb <- importRankings(featherFilePath)

test <- cisTarget(geneSets, rankingsDb, aucMaxRank=0.03*getNumColsInDB(bgRanking))

length(unique(test$nEnrGenes))==1

#FALSE

length(unique(test$enrichedGenes))==1

#FALSE

Is this the intended behaviour, and if it is, why?

Session Info:

R version 4.1.2 (2021-11-01)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /home/giuseppe.ad/.conda/envs/singlecell/lib/libopenblasp-r0.3.18.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] RcisTarget_1.14.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.8                  lattice_0.20-45            
 [3] zoo_1.8-9                   arrow_7.0.0                
 [5] png_0.1-7                   Biostrings_2.62.0          
 [7] assertthat_0.2.1            digest_0.6.29              
 [9] utf8_1.2.2                  mime_0.12                  
[11] IRdisplay_1.1.0.9000        R6_2.5.1                   
[13] GenomeInfoDb_1.30.1         repr_1.1.4.9000            
[15] stats4_4.1.2                RSQLite_2.2.10             
[17] evaluate_0.15               httr_1.4.2                 
[19] pillar_1.7.0                zlibbioc_1.40.0            
[21] rlang_1.0.1                 uuid_1.0-3                 
[23] data.table_1.14.2           annotate_1.72.0            
[25] blob_1.2.2                  S4Vectors_0.32.3           
[27] R.utils_2.11.0              R.oo_1.24.0                
[29] Matrix_1.4-0                AUCell_1.16.0              
[31] RCurl_1.98-1.6              bit_4.0.4                  
[33] DelayedArray_0.20.0         shiny_1.7.1                
[35] compiler_4.1.2              httpuv_1.6.5               
[37] pkgconfig_2.0.3             BiocGenerics_0.40.0        
[39] base64enc_0.1-3             htmltools_0.5.2            
[41] tidyselect_1.1.2            KEGGREST_1.34.0            
[43] SummarizedExperiment_1.24.0 tibble_3.1.6               
[45] GenomeInfoDbData_1.2.7      matrixStats_0.61.0         
[47] IRanges_2.28.0              XML_3.99-0.9               
[49] fansi_1.0.2                 crayon_1.5.0               
[51] dplyr_1.0.8                 later_1.3.0                
[53] bitops_1.0-7                R.methodsS3_1.8.1          
[55] grid_4.1.2                  jsonlite_1.8.0             
[57] xtable_1.8-4                GSEABase_1.56.0            
[59] lifecycle_1.0.1             DBI_1.1.2                  
[61] magrittr_2.0.2              graph_1.72.0               
[63] cli_3.2.0                   cachem_1.0.6               
[65] XVector_0.34.0              promises_1.2.0.1           
[67] ellipsis_0.3.2              generics_0.1.2             
[69] vctrs_0.3.8                 IRkernel_1.3.0.9000        
[71] tools_4.1.2                 bit64_4.0.5                
[73] Biobase_2.54.0              glue_1.6.2                 
[75] purrr_0.3.4                 MatrixGenerics_1.6.0       
[77] fastmap_1.1.0               AnnotationDbi_1.56.2       
[79] GenomicRanges_1.46.1        memoise_2.0.1              
[81] pbdZMQ_0.3-7               
s-aibar commented 2 years ago

Thank you for the report. Indeed, that is not the expected behavior.

That happens because the geneErnMaxRank argument is set to 5000 by default (which is too big for this database, which contains ~600 genes) and the enrichment method to "aprox" (which takes the 50-running average).

I have now added some extra check for these parameters in the adequate functions, and updated the vignette accordingly. This new version will be available with the next Bioconductor release (probably by the end of the month).

Regards, Sara