MaayanLab / enrichr_issues

5 stars 3 forks source link

[BUG] #7

Closed imichalop closed 3 years ago

imichalop commented 3 years ago

Dear Prof. Ma’ayan,

We decided to export the output of our gene coexpression analysis tool to Enrichr which is an amazing tool that provides numerous enrichment categories, an easy way to import data and comprehensible visualisation of the results. However, while examining the results, we also noticed some probable discrepancies, which we will try to explain:

1) Adjusted p-value calculation. We have concerns that the False Discovery Rate (FDR) adjusted p-values (visible in the table view of each enrichment category) are incorrect. If we sort the terms with the regular p-values, then the adj p-value of the first ranked term should also be the lowest one. We believe that this is due to the fact that the FDR algorithm wasn't implemented correctly: If a lowest ranked term produces a higher adj p-value than the previous one, then its adj p-value should be equal to the adj p-value of the previous term. In order to test this supposition, we used this list of genes as input to Enrichr:

ZPBP YBX2 PHF7 REXO5 SOX30

We turned our attention to the ARCHS4 TFs Coexp enrichment table. (IMG_1) The first 8 enrichment terms possess the same p-value, yet their adj p-value is different, which shouldn't be the case. The pseudocode of FDR algorithm is explained at:

https://riffyn.com/riffyn-blog/2017/10/29/false-discovery-rate

The step which seems to be missing in Enrichr is the following:

"If FDRtemp < FDR Then FDR = FDRtemp"

In order to test that the method of FDR is correct you can use the following commands in R:

p <- c(.0015,.002,.005, 0.1) p.adjust(p, method = 'fdr') [1] 0.004000000 0.004000000 0.006666667 0.100000000

As we can see, the adj p-value results of the two lowest p-value terms (0.0015,0.002) is the same, although the lowest term (0.0015) would produce a higher adj p-value otherwise (pvalue n/rank, 0.00154/1=0.006)

We can only suggest corrections on the adj p-value calculation, since it is the only ranking method we can fully understand.

2) Incorrect table header descriptions when mouse hovering. We have noticed that the description of the table headers is not configured correctly. For example, when mouse hovering over the p-value header in table view, the description for the calculation of the Combined Score is shown instead (IMG_1). We suppose this is due to the fact that the header descriptions were not updated from the previous version of the Enrichr website to the new one.

Furthermore, since we are determined to utilise Enrichr for our research, we would be grateful if a cutoff for each category (P-value, adj P-value, Odds Ratio, Combined Score) could be introduced in the form, for example "Show all results with an adj p-value < 0.05 (user selected number)". We believe it would be a useful addition, helping researchers to ignore non-statistically-significant results.

Thank you in advance.

Yours Sincerely,

Dr Ioannis Michalopoulos - Vasileios Zogopoulos

lachmann12 commented 3 years ago

Fixed, still needs to be rebuilt for Docker deployment and pushed to the cloud.

imichalop commented 3 years ago
I am afraid it is not fixed: If you try your 375 gene set example > Transcription > ENCODE and ChEA Consensus TFs from ChIP-X > Table > Sort by p-value (ASC): Index Name P-value Adjusted p-value Odds Ratio Combined score
1 ERG CHEA 0.001607 0.1671 3.33 21.44
2 NR2C2 ENCODE 0.002660 0.1383 2.29 13.55
3 GABPA ENCODE 0.01964 0.6808 1.33 5.24
4 HNF4A ENCODE 0.03835 0.9972 1.38 4.50
5 ZMIZ1 ENCODE 0.03860 0.8029 1.46 4.75
6 ELF1 ENCODE 0.08120 1.000 1.20 3.02
7 TAF1 ENCODE 0.08819 1.000 1.16 2.83
8 ZNF384 ENCODE 0.09471 1.000 1.39 3.27
9 TRIM28 CHEA 0.1012 1.000 1.78 4.07
10 RCOR1 ENCODE 0.1126 1.000 1.37 2.99

The second adjusted p-value is lower than the first one and the fifth one lower than the fourth one. Instead, the first adjusted p-value should get the value of the second one and the fourth one that of the fifth one.

imichalop commented 3 years ago

My proposed fix will probably stop the fact that most adjusted p-values are equal to 1. The highest adjusted p-value is equal to the highest original p-value which in most of the cases is lower than 1.

375 gene set example > Transcription > ARCHS4 TFs Coexp > Table > Sort by p-value (DESC):

Index Name P-value Adjusted p-value Odds Ratio Combined score
1 ETV1 human tf ARCHS4 coexpression 0.9967 1.000 0.18 0.00
2 TFAP2B human tf ARCHS4 coexpression 0.9967 1.000 0.18 0.00
3 NFIX human tf ARCHS4 coexpression 0.9967 1.000 0.18 0.00
4 ETV7 human tf ARCHS4 coexpression 0.9967 1.000 0.18 0.00
5 POU2F2 human tf ARCHS4 coexpression 0.9967 1.000 0.18 0.00
6 NR2E3 human tf ARCHS4 coexpression 0.9967 1.000 0.18 0.00
7 SOX30 human tf ARCHS4 coexpression 0.9967 1.000 0.18 0.00
8 NFE2L3 human tf ARCHS4 coexpression 0.9967 1.000 0.18 0.00
9 MXD1 human tf ARCHS4 coexpression 0.9967 1.000 0.18 0.00
10 DLX3 human tf ARCHS4 coexpression 0.9967 1.000 0.18 0.00

The adjusted p-value should be 0.9967 or less, not 1.

imichalop commented 3 years ago

Furthermore, to include hits with adjusted p-values higher than a set cut-off may be misleading to non-experienced users. In the case of the aforementioned example 375 gene set, none of the transcription factors is statistically significant, as the adjusted p-value of all exceeds the 0.05 or even 0.1 cut-off. Nevertheless, a list of TFs is produced.

lachmann12 commented 3 years ago

Sorry for the confusion. The code was updated but the service still needs to be redeployed to the servers. Only then will the changes be visible.

imichalop commented 3 years ago

Thank you for your prompt response. Happy New Year!

AviMaayan commented 3 years ago

Dear @imichalop Happy New Year! Can you please take a look at it to see if it corrected... Many thanks, Avi

imichalop commented 3 years ago

Dear @AviMaayan Thank you for your prompt response. I just checked it. It is perfect now! Well done! Happy New Year! Ioannis