results from enrichment analysis differs each run

zhezhenwang commented 2 years ago

Hi, I'm running a motif enrichment analysis against the background, using the code in the vignette and it seems calcBinnedMotifEnrR gets me a different result each time I run it. Is this something to be expected?

machlabd commented 2 years ago

hello @zhezhenwang! Assuming none of the other parameters in the function have been altered, you should get the same results unless you specify background = 'genome'. Is this the case? There, background regions with similar GC composition to foreground are randomly sampled from the genome. You can make it reproducible by setting a seed as shown here https://fmicompbio.github.io/monaLisa/articles/monaLisa.html#vsgenome. The seed is set in the function from the BiocParallel package. Let me know if you still have problems!

zhezhenwang commented 2 years ago

Hi @machlabd Thanks for that prompt reply! Yes, I'm using background = 'genome'. I'm wondering if there is a way to measure the enrichment in a more stable manner. Because I ran through the analysis several times obviously without set.seed and saw very different results. We are expecting to see KLF family show up, and in some cases, the result is full of them

In others, none

So that makes me wonder if I can trust my result or not. I know this is also the case for homer where they use random sequences as background to test against, so probably not an easy problem to solve, but just in case if you have any thoughts on this.

Aside from that, the tool is working great and is easy to implement! Thanks for the hard work!

machlabd commented 2 years ago

Hello! Yes, that is an important thing, to have a reasonable set of background regions, and it can indeed result in very different outcomes if the background set is not a good match to the foreground in terms of nucleotide composition. There is unfortunately no single easy answer for this but to try to manually define a reasonable background set. I would try to pay attention to the following: 1) when using background = 'genome', do you see a warning message coming up saying that the selected background regions do not match the foreground well in GC composition? 2) Is it possible to limit the random sampling of background to parts of the genome that you think make more sense in terms of nucleotide composition similarity to the foreground set you have? If so, you can specify this in the genome.regions argument of calcBinnedMotifEnrR. 3) do some diagnostic plots to look at overall dinucleotide composition differences between foreground and background regions. In monaLisa we provide the plotBinDiagnostics function. Since you are using background = 'genome' you can specify the bins argument as a factor the same length as the seqs with two levels: foreground and background. I would compare this plot to the motif enrichments you get. As an example, if the motifs that are enriched are GC rich, and your background regions are GC poor, it is worth making sure your background sequences are similar in GC composition to the foreground and then seeing if GC rich motifs are still enriched after accounting for this. 4) try manually defining background regions using the nullranges package mentioned here in the vignette https://fmicompbio.github.io/monaLisa/articles/monaLisa.html#vsgenome and use them as a second bin in calcBinnedMotifEnrR, which will give you more control over the used background.

I hope this helps in some way. I'm glad the tool is helpful for you and thanks for your feedback!

machlabd commented 2 years ago

Hi @zhezhenwang, I will close this issue for now. Feel free to re-open it if you have more questions.

fmicompbio / monaLisa

results from enrichment analysis differs each run #55