fmicompbio / monaLisa

binned motif enrichment analysis and visualisation
https://fmicompbio.github.io/monaLisa/
GNU General Public License v3.0
36 stars 6 forks source link

Impact of highly correlated motifs on the binned approach #56

Closed lengfei5 closed 2 years ago

lengfei5 commented 2 years ago

Hi there,

This is really nice package and I like it a lot. In the tutorial you mentioned:

"If the user is interested in working with all correlated motifs, the binned approach is preferable as the motifs are independently tested for significance"

I assume that the significance is calculated by comparing the sample vs background. I am just wondering if the highly correlated motifs will also comprise the significance calculation. Thanks.

Best, Jingkui

mbstadler commented 2 years ago

Hi Jingkui

In the binned approach, each motif is analyzed separately and the resulting enrichments and raw p values are indeed independent of any other motif as mentioned in the vignette.

The presence of other motifs will affect the calculation of the FDR values from raw p values (multiple testing correction). This is however dependent on the number and overall distribution of p values and not directly related to the correlation between motifs.

Finally, correlated motifs will likely produce similar results, for example they will have similar enrichments across bins. In order to correctly interpret such results it is therefore helpful to look at the motifs that give rise to similar results, for example by using plotMotifHeatmaps(..., show_seqlogo = TRUE).

I hope that clarifies your issue.

lengfei5 commented 2 years ago

Dear Michael @mbstadler,

Great to hear from you and greeting from IMP in Vienna. Thank you for clarifying all those details and I do see your points regarding the FDR calculation dependency on other motifs. Sorry that I am still a bit puzzled by the fact that the raw p-value calculation is probably depending on other motifs, including the highly correlated ones, am I right ?

If I got it correct, the p-value of specific motif will be calculated roughly by looking at the counts in the sample and background: n (specific motif occurrence) N (total occurrence across all motifs) nb (the same motif occurrence in the background) Nb (total occurrence across motifs in the background)

for an enriched motif, (n/N) /(nb/Nb) > 1, the p-value will be calculated accordingly. Just imagine an extreme case we now add a identical motif, the above will become (n/(N+n))/(nb/(Nb+nb)) because the new motif is identical and affecting the total counts in the sample and also the background.

It turns out that (n/(N+n))/(nb/(Nb+nb)) < (n/N)/(nb/Nb) the one without identical motif there.

So I feel that the highly correlated motifs are also affecting the p-value although the motif is considered independent. Sorry I am just sharing some really premature thoughts, which I have been thinking for a couple of weeks and hoping to be clarified by someone. And suddenly I came across your monaLisa dealing the similar issue and I just can't help asking.

Best, Jingkui

mbstadler commented 2 years ago

Dear Jingkui @lengfei5

Thank you for the good wishes from Vienna!

I think that your assumption is based on a misunderstanding of how the p values are calculated. Let's take the (default) Fisher's exact test as a basis (see https://github.com/fmicompbio/monaLisa/blob/master/R/utils_enrichment.R#L29-L59):

The p value is calculated using a contingency table:

              withHit  noHit
   foreground    x       y
   background    z       w

These four numbers (x, y, z and w) are the (weighted) number of sequences with or without predicted motif hit in the foreground and background sets of sequences. They only depend on the sequences (which are given and constant) and the single motif that is currently being analyzed, so as mentioned, the presence of other motifs has no impact.

Let me know if you still have any questions. Best, Michael

lengfei5 commented 2 years ago

Hi Michael,

Thanks for clarifying this and I got it now. You are using the number of sequenced with/without predicted motifs in the foreground and background, that is something I missed before. Appreciate the discussion with you.

Best, Jingkui

mbstadler commented 2 years ago

Yes, exactly. I will close the issue for now - feel free to reopen it in case you have further questions. Best, Michael