geneontology / panther-enrichment

One of the main uses of the GO is to perform enrichment analysis on gene sets. For example, given a set of genes that are up-regulated under certain conditions, an enrichment analysis will find which GO terms are over-represented (or under-represented) using annotations for that gene set.
1 stars 2 forks source link

Issues with the enrichment tool #10

Open ValWood opened 5 years ago

ValWood commented 5 years ago

Love the new front page. I thought I should check the enrichment tool again. I tried an enrichment with the complete set of 456 fission proteins annotated to "mitotic cell cycle"

This was the result:

mitotic cell cycle

This is due to the none inclusion of the regulates relationship. ...

This is a really big problem, it makes much more sense to include this relationship for enrichment if "regulation" is correctly or incorrectly annotated. The number of indirect causally upstream annotations to "regulates" is a pervasive problem, but not large enough to subvert the output of an enrichment (and would usually do so in a positive way because the genes are intimately connected to the process enriched for, by affecting the process albeit indirectly).

The larger problem is that, often you "regulate a sub-process" you are part of the process. We don't instantiate this in GO, but it has never really worried me because the majority of tools include "regulates" relationship by default.

I could provide many 1000 of examples. Here is one I was looking at earlier

Translation elongation factor eIF-5A; is annotated to positive regulation of translational elongation (IGI) positive regulation of translational termination (ISA)

this means that eIF-5A would not enrich to translation......the downside of performing enrichment like this is detrimental for every list that I have tested. Try it!

ValWood commented 5 years ago

test list for mitotic cell cycle gene_list.txt

ValWood commented 5 years ago

Another example. I use the list of genes involved in "DNA replication" replication.txt

but I get only 106 out of my list of 118 enriching replication

ValWood commented 5 years ago

I would expect all of these to be annotated to "DNA replication"

Systematic ID Name Product description SPAC27E2.05 cdc1 DNA polymerase delta small subunit Cdc1 SPAC20G8.01 cdc17 ATP-dependent DNA replication ligase Cdc17 SPBC14C8.07c cdc18 MCM loader SPBC11B10.09 cdc2 cyclin-dependent protein kinase Cdk1/Cdc2 SPBC25H2.13c cdc20 DNA polymerase epsilon catalytic subunit Pol2 SPBC1347.10 cdc23 MCM-associated protein Mcm10 SPAC8F11.07c cdc24 DNA replication protein Cdc24 SPAC24H6.05 cdc25 M phase inducer tyrosine phosphatase Cdc25 SPBC1734.02c cdc27 DNA polymerase delta subunit Cdc27 SPAC17D4.02 cdc45 DNA replication pre-initiation complex subunit Cdc45 SPBC336.04 cdc6 DNA polymerase delta catalytic subunit Cdc6 SPBC12D12.02c cdm1 DNA polymerase delta subunit Cdm1 SPCC18B5.11c cds1 replication checkpoint kinase Cds1 SPBC428.18 cdt1 replication licensing factor Cdt1 SPAC17H9.19c cdt2 WD repeat protein Cdt2 SPAC3G6.11 chl1 ATP-dependent DNA helicase Chl1 (predicted) SPBC902.02c ctf18 Ctf18 RFC-like complex subunit Ctf18 SPCC338.08 ctp1 CtIP-related endonuclease SPAC17G6.12 cul1 cullin 1 SPBC17D11.08 dca7 WD repeat protein, DDB1 and CUL4-associated factor Dca7 (predicted) SPCC550.13 dfp1 Hsk1-Dfp1 kinase complex regulatory subunit Dfp1 SPBC16D10.04c dna2 DNA replication endonuclease-helicase Dna2 SPBP8B7.14c dpb2 DNA polymerase epsilon catalytic subunit B, Dpb2 SPCC16C4.22 dpb3 DNA polymerase epsilon Dpb3 SPBC3D6.09 dpb4 DNA polymerase epsilon subunit Dpb4 SPAC6B12.11 drc1 replication preinitiation complex assembly protein SPBC947.11c elg1 DNA replication factor C complex subunit Elg1 SPAPB1E7.06c eme1 Holliday junction resolvase subunit Eme1 SPBC29A10.05 exo1 exonuclease I Exo1 SPBC336.01 fbh1 DNA helicase I, ubiquitin ligase F-box adaptor Fbh1 SPAC9.05 fml1 ATP-dependent 3' to 5' DNA helicase, FANCM ortholog Fml1 SPAC20H4.04 fml2 ATP-dependent 3' to 5' DNA helicase (predicted) SPBC776.12c hsk1 Dbf4(Dfp1)-dependent protein kinase Hsk1 SPBC365.09c kin17 human KIN ortholog (predicted) SPBC146.09c lsd1 histone demethylase SWIRM1 SPAC1687.04 mcb1 MCM binding protein homolog Mcb1 SPAPB1E7.02c mcl1 DNA polymerase alpha accessory factor Mcl1 SPBC4.04c mcm2 MCM complex subunit Mcm2 SPCC1682.02c mcm3 MCM complex subunit Mcm3 SPCC16A11.17 mcm4 MCM complex subunit Mcm4/Cdc21 SPAC1B2.05 mcm5 MCM complex subunit Mcm5 SPBC211.04c mcm6 MCM complex subunit Mcm6 SPBC25D12.03c mcm7 MCM complex subunit Mcm7 SPAC27D7.03c mei2 RNA-binding protein involved in meiosis Mei2 SPAC14C4.03 mek1 Cds1/Rad53/Chk2 family protein kinase Mek1 SPAC26H5.02c mgs1 DNA replication ATPase Mgs1 (predicted) SPBC2D10.16 mhf1 CENP-S ortholog, FANCM-MHF complex subunit Mhf1 SPCC576.12c mhf2 CENP-X ortholog, FANCM-MHF complex subunit Mhf2 SPAC3H8.05c mms1 Cul8-RING ubiquitin ligase complex subunit Mms1 (predicted) SPAC694.06c mrc1 claspin, Mrc1 SPAC8F11.03 msh3 MutS protein homolog 3 SPAC637.12c mst1 KAT5 family histone acetyltransferase Mst1 SPAC6B12.02c mus7 DNA repair protein Mus7/Mms22 SPCC4G3.05c mus81 Holliday junction resolvase subunit Mus81 SPBC651.10 nse5 Smc5-6 complex non-SMC subunit Nse5 SPAC11E3.08c nse6 Smc5-6 complex non-SMC subunit Nse6 SPBC29A10.15 orc1 origin recognition complex subunit Orc1 SPBC685.09 orc2 origin recognition complex subunit Orc2 SPAC3H1.01c orc3 origin recognition complex subunit Orc3 SPBP23A10.13 orc4 origin recognition complex subunit Orc4 SPBC646.14c orc5 origin recognition complex subunit Orc5 SPBC2A9.12 orc6 origin recognition complex subunit Orc6 SPBC29A10.03c pcf1 CAF assembly factor (CAF-1) complex large subunit Pcf1 SPAC26H5.03 pcf2 CAF assembly factor (CAF-1) complex subunit B, Pcf2 SPAC25H1.06 pcf3 CAF assembly factor (CAF-1) complex subunit C, Pcf3 SPBC16D10.09 pcn1 PCNA SPBC887.14c pfh1 5' to 3' DNA helicase Pif1/Pfh1 SPCC126.02c pku70 Ku domain protein Pku70 SPCC338.16 pof3 F-box protein Pof3 SPCC24B10.22 pog1 mitochondrial DNA polymerase gamma Pog1 SPAC3H5.06c pol1 DNA polymerase alpha catalytic subunit SPAC4D7.03 pop2 F-box/WD repeat protein Pop2 SPBP23A10.09 psf1 GINS complex subunit Psf1 SPBC725.13c psf2 GINS complex subunit Psf2 SPAC227.16c psf3 GINS complex subunit Psf3 SPAC3G6.06c rad2 FEN-1 endonuclease Rad2 SPAC9E9.08 rad26 ATRIP, ATR checkpoint kinase regulatory subunit Rad26 SPBC216.05 rad3 ATR checkpoint kinase Rad3 SPAC23C4.18c rad4 BRCT domain protein Rad4 SPAC1556.01c rad50 DNA repair protein Rad50 SPAC644.14c rad51 RecA family recombinase Rad51/Rhp51 SPAC30D11.10 rad52 DNA recombination protein, Rad51 mediator Rad52 (previously Rad22) SPBC1921.02 rad60 DNA repair protein, SUMO-related Rad60 SPBC1198.11c reb1 RNA polymerase I transcription termination factor/ RNA polymerase II transcription factor Reb1 SPBC16E9.17c rem1 meiosis-specific cyclin Rem1 SPBC23E6.07c rfc1 DNA replication factor C complex subunit Rfc1 SPAC23D3.02 rfc2 DNA replication factor C complex subunit Rfc2 SPAC27E2.10c rfc3 DNA replication factor C complex subunit Rfc3 SPAC1687.03c rfc4 DNA replication factor C complex subunit Rfc4 SPBC83.14c rfc5 DNA replication factor C complex subunit Rfc5 SPAC6F6.17 rif1 telomere length regulator protein Rif1 SPAC2F3.04c rim1 mitochondrial single-stranded DNA binding protein Rim1 SPBC336.06c rnh1 ribonuclease H Rnh1 (predicted) SPAC4G9.02 rnh201 ribonuclease H2 complex subunit Rnh201 SPAC2G11.12 rqh1 RecQ type DNA helicase Rqh1 SPAC17A2.12 rrp1 ATP-dependent DNA helicase/ ubiquitin-protein ligase E3 (predicted) SPBC23E6.02 rrp2 ATP-dependent DNA helicase/ ubiquitin-protein ligase E3 (predicted) SPAC22F8.07c rtf1 replication termination factor Rtf1 SPAC1D4.09c rtf2 replication termination factor Rtf2 SPBC32F12.09 rum1 CDK inhibitor Rum1 SPCC1672.02c sap1 switch-activating protein Sap1 SPAC24H6.06 sld3 DNA replication pre-initiation complex subunit Sld3 SPBP4H10.21c sld5 GINS complex subunit Sld5 SPAP27G11.15 slx1 structure-specific endonuclease catalytic subunit Slx1 SPAC688.06c slx4 structure-specific endonuclease subunit Slx4 SPCC553.09c spb70 DNA polymerase alpha B-subunit SPAC6B12.10c spp1 DNA primase catalytic subunit Spp1 SPBC17D11.06 spp2 DNA primase large subunit Spp2 SPAC4H3.05 srs2 ATP-dependent DNA helicase, UvrD subfamily SPBC660.13c ssb1 DNA replication factor A subunit Ssb1 SPCC1753.01c ssb2 single-stranded DNA binding protein Ssb2 SPCC23B6.05c ssb3 DNA replication factor A subunit Ssb3 SPBC216.06c swi1 replication fork protection complex subunit Swi1 SPBC30D10.04 swi3 replication fork protection complex subunit Swi3 SPAC16A10.07c taz1 shelterin complex subunit Taz1 SPCC23B6.03c tel1 ATM checkpoint kinase SPBC16G5.12c top3 DNA topoisomerase III SPAC6F6.16c tpz1 shelterin complex subunit Tpz1 SPAC12B10.15c ribonuclease H2 complex subunit (predicted) SPBC1347.08c ribonuclease H2 complex subunit (predicted) SPCC737.07c DNA polymerase alpha-associated DNA helicase A (predicted)

ValWood commented 5 years ago

for example, you could be annotated only to "positive regulation of DNA primase activity (GO:1903934)" (a process) which is clearly part of DNA replication?

ukemi commented 5 years ago

If the part_of relationship between a regulation process and another process is always true, it should be asserted in the ontology. Sometimes it is difficult to make the decision, but lots of times it is not.

13926

ValWood commented 5 years ago

I agree...but blanket exclusion of "regulates' annotations definitely makes enrichments worse....

hattrill commented 5 years ago

I think that it good to think of what users will be enriching on - an RNAi screen, a proteomics experiment - the candidates they have will include regulators - we should include an option to expand the net via relations.

ValWood commented 5 years ago

Also the enrichment options from The GO home page seem to be "sticky"

The first time I visited I got choices to do different tyes of analyses. Now whatever I do, it always things I want to do the "statistical overrepresentation option" and I don't see any other choices.....

lpalbou commented 5 years ago

Hi @ValWood , I am unable to reproduce the "sticky" problem.

  1. Are you referring to the main GO page and you can not change for instance the BP/MF/CC or species ? Or are you referring from the page on pantherdb.org ?
  2. What browser are you using and do you have a set of steps to reproduce it ?
  3. At this time, the GO site performs only overrepresentation test (single small gene set), we do not allow enrichment analysis (single large gene set with associated expression values). To perform the enrichment test, you need to go to http://pantherdb.org/ directly
lpalbou commented 5 years ago

@ValWood I tried on 2 different machines but wasn't able to see it, maybe someone was working on something at that time but the expected behavior when clicking on "Launch" from the go site is to arrive on this page where the "type" of analysis is already set to overrepresentation test:

screen shot 2019-02-16 at 4 44 06 pm

Concerning the closures used for enrichment, I have passed on the message and we'll see what can be done. In the next iteration of Panther site, it sounds like a good idea to have an option for that.

ValWood commented 5 years ago

Yeah I have no idea what I was doing. It would be good to be able to toggle "regulates" transitivity/closure though. val

huaiyumi commented 5 years ago

Regulates relation was once in PANTHER many years ago. It caused tremendous confusion to the users and the results were hard to interpret. That is why we pulled it. It is possible that the ontology is better now, and we can certainly test it again. We will put it on the todo list.

lpalbou commented 5 years ago

I personally like the idea of a toggle to enable an enrichment on terms directly related to the genes and on the terms the genes regulate. We have been discussing it from time to time too

ValWood commented 5 years ago

It is a major problem because many genes that are parts of processes are also involved in regulation. Our current annotation conventions require that we annotate these to "regulates". Therefore much gets lost from enrichments and slims.

see https://github.com/pombase/curation/issues/2245 for examples

lpalbou commented 5 years ago

Just to confirm an action item: would a toggle to enrich either on is_a/part_of (current behavior) OR regulates only would be an acceptable solution ?

ValWood commented 5 years ago

Toggle would be a good start. Personally I think "regulates" should be the default, and can demo this is (I have never seen a dataset where this would not be the case), ....but its a consortium decision. Although, it would match what other tools do....

cmungall commented 5 years ago

Just to confirm an action item: would a toggle to enrich either on is_a/part_of (current behavior) OR regulates only would be an acceptable solution ?

I believe the request is to include genes that are involved in regulation of the process, as well as genes involved in the process

lpalbou commented 5 years ago

@cmungall yes but it seems this needs more discussion and at least for the moment to have both:

Regulates relation was once in PANTHER many years ago. It caused tremendous confusion to the users and the results were hard to interpret. That is why we pulled it. It is possible that the ontology is better now, and we can certainly test it again. We will put it on the todo list.

ValWood commented 5 years ago

Yes if toggle

is_a/part_of OR is_a/part_of + regulates

lpalbou commented 5 years ago

OK, so you don't think there is a value in highlighting only the functions that are regulated by genes ? We could still have:

ValWood commented 5 years ago

personally I don't think option 3 would be useful....