geneontology / go-ontology

Source ontology files for the Gene Ontology
http://geneontology.org/page/download-ontology
Creative Commons Attribution 4.0 International
223 stars 40 forks source link

Proposal : remove pre-composed MF involved in BP terms #14138

Closed ValWood closed 1 year ago

ValWood commented 7 years ago

Is this a good place to always post-compose?

http://www.ebi.ac.uk/QuickGO/searchterms/activity%20involved%20in?ontologyType=GO&aspect=Function

not sustainable......

pgaudet commented 7 years ago

Hi,

We discussed this on the editors call on Sept 1, and the conclusion of the discussion was although we don't want to create many more of these terms, some may be useful, so we won't obsolete them all. However we can obsolete those that have not been much used.

@dosumis @cmungall @vanaukenk does this correctly summarize the discussion ?

Thanks, Pascale

dosumis commented 7 years ago

However we can obsolete those that have not been much used.

Where usage includes grouping - i.e. subclass or subpart terms have been used and the 'involved in' term provides a useful way of grouping them.

ValWood commented 7 years ago

So, the reason I suggested it is that it is pretty confusing for curation to have these terms, if the more "modern" GOCam way to do this is by post-composing MF involved in BP. It makes a nice point where immediately a curator and an ontology editor would know immediately if a term belongs in the ontology. Obviously it isn't practical to precompose every MF involved in BP term -there would be, quite literally hundreds of thousands . (i.e. every single function annotation which can't have a MF part_of BP link instantiated in the ontology, which is MOST).

So, for the terms which are considered useful why are they more useful than others (this group must be pretty arbitrary which are included as precomposed function terms already). I'm not suggesting we go ahead and immediately obsolete, but it would be nice if a long term goal was to post compose all of these. It would just be easier to explain to curators, and semantically identical.

If the reason is for enrichments the tools should handle this eventually (because there will be many GO-Cam annotations of this type which might also be useful to enrich over).

ValWood commented 7 years ago

Where usage includes grouping - i.e. subclass or subpart terms have been used and the involved in term provides a useful way of grouping them.

Can you provide a specific example of this so that I can see why they are useful?

tberardini commented 7 years ago

I think this was the example we spoke of:

http://amigo.geneontology.org/amigo/term/GO:0004197#display-lineage-tab

See the different children with %involved in% of cysteine-type endopeptidase activity

ValWood commented 7 years ago

Ah right, In this case, the suggestion isn't that we can't make these annotations (we can make them all the time , we make 100's such annotations, and people make them as part of Noctua models)

Instead it would be "cysteine-type peptidase" part_of "apoptotic process" or whatever. It's semantically the same, and it might not seem important, but annotations are displayed on gene pages, and we need to always be mindful of what they look like in a single gene context. With the current situation, terms can be arbitrarily pre-composed or post-composed at this point. This can end up looking very, very messy on gene pages once you get lots of F-P links.

These are all post composed F-P links http://preview.pombase.org/gene/SPAC24B11.06c http://preview.pombase.org/gene/SPBC11B10.09

For cdc2, (2nd link) you can imagine how messy this would look if all of these substrates, which regulate individual processes, at different times were a mixture of pre composed and post-composed MF involved in BP terms.

Terms which are precomposed would look like this:

cyclin-dependent protein serine/threonine kinase activity involved in negative regulation of conjugation with cellular fusion cyclin-dependent protein serine/threonine kinase activity involved in correction of merotelic kinetochore attachment, mitotic has substrate fkh2 cyclin-dependent protein serine/threonine kinase activity involved in negative regulation of mitotic spindle elongation has substrate mde4 during mitotic anaphase A

Instead we can display the function term once, which keeps it much cleaner.

The post-composition also allows you to manage your display and group flexibly by similar process, (or for by phase, or whatever makes most sense biologically).

If it still doesn't make sense, look at the cdc2 page in AmiGO http://amigo.geneontology.org/amigo/gene_product/PomBase:SPBC11B10.09 288 independent anntoations, so you need to scroll through 3 pages even if you opt for the maximum 100) - incidentally AmiGO isn't even showing part_of BP extensions yet, I'll open a ticket for that.

ValWood commented 7 years ago

This is more of a proposal for the MF ontology, so I added MF refactoring and removed documentation.

ValWood commented 7 years ago

I still think it would be good to obsolete the unused terms right way, with a recomendation to make MF involved in BP annotation.

I think some are terms PomBase requested when we were going down this route. I think we now changed most things to MF involved in BP for consistency.

ValWood commented 6 years ago

there are only 196 MF terms, and many of these are already obsolete (I can't filter out the obsolete ones in QuickGO so it isn't possible to get the actual number). Many are not used, or have a very small number of annotations.

https://www.ebi.ac.uk/QuickGO/searchterms/involved%20in?aspect=Function

It would be great to get rid of the unused ones.....

RLovering commented 6 years ago

Hi All

I have brought this issue to the SynGO group, because I am aware that their annotations are not released yet and many of the terms you have listed were created for this project. Their response is below:

There is a large batch of annotations underway atm, Dustin Egbert is currently working to make sure everything lands well in GO-CAM, so that would add content to a few of the terms you listed.

I would advocate that we keep all ontology terms assigned to the SynGO goslim. We made the synapse ontology 'models' with correctness and consistency in mind in consorted effort with many leading scientists in our field. If one would saturate the annotation of all synapse biology these terms should be used at some point. And the longterm goal for SynGO would be to saturate synaptic annotations as far as published data would allow us to, so these terms should be needed at some point anyway. Perhaps the team leading these GO cleanup efforts could set a separate timeline for re-evaluation of SynGO created terms, say december 2019, to test whether there are any terms that proved to be unsuited for annotation (although recognized as actual processes in the synapse at the time of creation) ?

RLovering commented 6 years ago

FYI this is the list of terms that I identified as relevant to SynGO and this discussion (but I may have missed some): GO:0099520 ion antiporter activity involved in regulation of presynaptic membrane potential GO:0098697 ryanodine-sensitive calcium-release channel activity involved in regulation of postsynaptic cytosolic calcium levels GO:0098695 inositol 1,4,5-trisphosphate receptor activity involved in regulation of postsynaptic cytosolic calcium levels GO:1905057 voltage-gated calcium channel activity involved in regulation of postsynaptic cytosolic calcium levels
GO:1905056 calcium-transporting ATPase activity involved in regulation of presynaptic cytosolic calcium ion concentration
GO:1905055 calcium:cation antiporter activity involved in regulation of presynaptic cytosolic calcium ion concentration
GO:1905054 calcium-induced calcium release activity involved in regulation of presynaptic cytosolic calcium ion concentration
GO:1905059 calcium-transporting ATPase activity involved in regulation of postsynaptic cytosolic calcium ion concentration
GO:1905058 calcium-induced calcium release activity involved in regulation of postsynaptic cytosolic calcium ion concentration
GO:1905060 calcium:cation antiporter activity involved in regulation of postsynaptic cytosolic calcium ion concentration GO:0099521 ATPase coupled ion transmembrane transporter activity involved in regulation of presynaptic membrane potential GO:0099507 ligand-gated ion channel activity involved in regulation of presynaptic membrane potential GO:0099508 voltage-gated ion channel activity involved in regulation of presynaptic membrane potential GO:0099626 voltage-gated calcium channel activity involved in regulation of presynaptic cytosolic calcium levels GO:0099635 voltage-gated calcium channel activity involved in positive regulation of presynaptic cytosolic calcium levels GO:0099581 ATPase coupled ion transmembrane transporter activity involved in regulation of postsynaptic membrane potential GO:0099582 neurotransmitter receptor activity involved in regulation of presynaptic cytosolic calcium ion concentration
GO:0099579 G-protein coupled neurotransmitter receptor activity involved in regulation of postsynaptic membrane potential GO:0099102 G-protein gated potassium channel activity involved in regulation of postsynaptic membrane potential GO:0098872 G-protein coupled neurotransmitter receptor activity involved in regulation of postsynaptic cytosolic calcium ion concentration

RLovering commented 6 years ago

I have also identified 8 terms which were created as part of the cardiac electrophysiology project: cardiac set:

GO:0097364 stretch-activated, cation-selective, calcium channel activity involved in regulation of action potential
GO:0097365 stretch-activated, cation-selective, calcium channel activity involved in regulation of cardiac muscle cell action potential GO:0086058 voltage-gated calcium channel activity involved in Purkinje myocyte cell action potential
GO:0086084 cell adhesive protein binding involved in Purkinje myocyte-ventricular cardiac muscle cell communication GO:0086085 cell adhesive protein binding involved in SA cardiac muscle cell-atrial cardiac muscle cell communication GO:0086088 voltage-gated potassium channel activity involved in Purkinje myocyte action potential repolarization
GO:0086081 cell adhesive protein binding involved in atrial cardiac muscle cell-AV node cell communication
GO:0086086 voltage-gated potassium channel activity involved in AV node cell action potential repolarization

As there is now a cell adhesion activity term I am happy for the cell adhesive terms to be removed. The channel activities associated with specific cell types were very hard to find direct evidence for, I think most of this evidence is from IEP. Therefore if you want to delete these that is OK.

I had not had the chance to capture stretch-activated regulation of action potential, if you want to delete these that is OK.

Ruth

ValWood commented 6 years ago

Hi Ruth,

Just a thought but you can easily precompose these MF-BP annotations using GO:0099520 ion antiporter activity part_of regulation of presynaptic membrane potential

And I think there is a pipeline which instantiates the BP "regulation of presynaptic membrane potential" from the annotation extension for slimming? @cmungall could confirm (but I'm sure I have seen them)

I assume that "regulation of presynaptic membrane potential" are the terms in your slims, not the MF

GO:0097364 stretch-activated, cation-selective, calcium channel activity involved in regulation of action potential

which would not make much sense a s a slim term as it will only be used at most a handful of times (human only seems to have 3 stretch-activated, cation-selective, calcium channel activity total FAM155A,FAM155B and TRPV4, unless many are unannotated?)

This is likely to be the case for all of the MF terms linked to processes here (these by their position are necessarily leaf nodes)

ValWood commented 6 years ago

second e.g. I looked at ryanodine-sensitive calcium-release channel activity has no annotations... so this would not slim either

RLovering commented 6 years ago

Hi Val

there are so many points to address in your comments.

  1. I have not been involved in the SynGO slim. I don't think this has anything to do with your idea of a slim. I think the SynGO slim is a set of terms that are included in their annotation tool which limits the terms that the expert scientists can use for annotation. ie rather than their experts having to find the terms they want to apply amongst the 42,000 GO terms they only have an option to apply 200 term (don't quote me on the number).

  2. SynGO has created >2000 annotation which have not been put in the GO annotation database therefore we have no idea how many annotations will be associated with their terms until the data arrives in the GO annotation database, so there is no point saying there are no annotations.

  3. When the annotation extension paper was written reviewers questioned whether data would be lost by this process. The response was no. However, there are no analysis tools which can create the precomposed terms and if there are (for example) 20 human proteins which could be associated with the term GO:0099508 voltage-gated ion channel activity involved in regulation of presynaptic membrane potential then this might be a useful grouping term. If this term is removed then instead in the 'conventional' GO annotation files this data will only be captured as : 'voltage-gated ion channel activity' AND 'regulation of presynaptic membrane potential'. In human there will be 100s of proteins annotated to 'voltage-gated ion channel activity' and many 100s of proteins annotated to 'regulation of presynaptic membrane potential'. But maybe these GO term groupings will now be too large to be useful.

I have yet to see a system which will recreate the equivalent GO term GO:0099508 voltage-gated ion channel activity involved in regulation of presynaptic membrane potential from a GO term with an annotation extension ie from voltage-gated ion channel activity Annotation extension part_of regulation of presynaptic membrane potential. therefore all the proteins that would be grouped within the specific voltage-gated ion channel activity involved in regulation of presynaptic membrane potential term will no longer be grouped in this way.

I would therefore appreciate it if the removal of all these terms could be delayed until a system is in place which effectively prevents this data from being unavailable for conventional GO users (who ever they are???)

It seems to me that this is what Judy was trying to say in https://github.com/geneontology/go-ontology/issues/12427#issuecomment-347197649 but I maybe wrong. @judyblake

  1. wrt to your comment: (human only seems to have 3 stretch-activated, cation-selective, calcium channel activity total FAM155A,FAM155B and TRPV4, unless many are unannotated?). I have been trying to explain for years that there are a lot of annotatable human papers which have not been annotated. admittedly many of these will have duplicative information. However I have no idea how many stretch-activated, cation-selective, calcium channels there are. as TRPV4 is Transient receptor potential cation channel subfamily V member 4; HGNC lists around 30 members in this family, however I am not sure that all are stretch activated, and probably not all calcium channels. It seems very odd that 2 FAM proteins have been annotated to this very specific term but I am not planning to follow this up. Although I agree if there are only 3 stretch-activated, cation-selective, calcium channel activity per genome then this is not a useful grouping term. Is there going to be a new rule about how many gps are required in order for a GO term to exist?

Probably more of your points to address but that it is for now

Ruth

ValWood commented 6 years ago

Hi Ruth,

1&2 are not directly relevant here, in this case.

However for point 3.

When the annotation extension paper was written reviewers questioned whether data would be lost by this process. The response was no.

The answer is still no, no information is lost.

Re: "However, there are no analysis tools which can create the precomposed terms and if there are (for example) 20 human proteins which could be associated with the term GO:0099508 voltage-gated ion channel activity involved in regulation of presynaptic membrane potential then this might be a useful grouping term. If this term is removed then instead in the 'conventional' GO annotation files this data will only be captured as : 'voltage-gated ion channel activity' AND 'regulation of presynaptic membrane potential'. In human there will be 100s of proteins annotated to 'voltage-gated ion channel activity' and many 100s of proteins annotated to 'regulation of presynaptic membrane potential'. But maybe these GO term groupings will now be too large to be useful."

But functions a "small processes". This means that it is rarely useful to slim over MF terms (because they are heterogeneous with respect to process). Sometimes you observe and enrichment over function terms, but this is usually as a result of an enriched process. For example you might see enrichment for protein kinases for a gene set involved in "regulation of G2/M". This observation isn't telling you anything about your results set, it is only confirming known biology, that "regulation of G2/M" is largely performed by protein kinases. So, I would discourage people from using MF terms for enrichment.

In fact, using MF terms for enrichment could be positively misleading if there is an annotation bias towards pre-composed function terms. For example imagine you have a set of 20 differentially expressed genes, and they include 4 or 5 'voltage-gated ion channel activity' which are involved in 'regulation of presynaptic membrane potential' but these are also involved in other multicellular processes. If these 4 or 5 'voltage-gated ion channel activity happen to be involved in, and annotated to 'voltage-gated ion channel activity involved in regulation of presynaptic membrane potential ' and a GO user enriches over the MF ontology (not understanding the precomposed terms), it is quite likely that they could see an enrichment to , 'voltage-gated ion channel activity involved in regulation of presynaptic membrane potential ' and erroneously conclude that their genes were enriched for the process of 'regulation of presynaptic membrane potential', even though this is due to a functional overlap between voltage-gated ion channels and the multiple processes they are involved in.

If they then enrich over biological process, and they have no other genes involved in the pathways related to "regulation of presynaptic membrane potential" (which will presumably include many more genes involved in this regulatory process when the annotation is provided), the enrichment to 'regulation of presynaptic membrane potential' would disappear, when compared correctly to the total set of genes which are involved in "regulation of presynaptic membrane potential" .

So, we should be careful recommending that people enrich over function terms (because it isn't usually useful and will not tell you as much as a valid process enrichment), and more especially since enriched "function-involved-in-process" precomposed terms could be biologically misleading.

Remember that the practice of "F involved in P" practice was established before we implemented annotation extensions, and is not practical or scalable. It is more confusing to have some terms like this exist (even more as they could provide potentially misleading enrichment results).

I don't understand what you mean by "too large to be useful". I can't see how "large GO term grouping" is a valid reason to include, if you are only using them to group particular function by process in the MF ontology (note that, if the purpose is to sub-group a particular process into functions, this can be done easily by taking the list of genes annotated to a particular process and slimming over the functions).

But I agree if the annotations are imminent it seems sensible to wait until they are submitted and then see if the precomposed provide any substantial advantages. At the moment this is all moot because as yet there is no annotation to the process 'regulation of presynaptic membrane potential', so there are no large GO grouping terms to discuss, or investigate.

Val

ValWood commented 6 years ago

The useful example above from the discussion was http://amigo.geneontology.org/amigo/term/GO:0004197#display-lineage-tab cysteine-type endopeptidase activity involved in plant-type hypersensitive response

plant-type hypersensitive response has 86 annotations

what makes unannotated "cysteine-type endopeptidase activity involved in plant-type hypersensitive response" especially useful?

why not phosphatidic acid phosphatase involved in plant-type hypersensitive response voltage dependent anion channel involved in plant-type hypersensitive response (other activities involved in this process)

mah11 commented 6 years ago

I'm confident that Ruth will correct me if I'm wrong, but I didn't think she was suggesting to slim over MF terms. Rather, she's worried that currently available tools won't give the same results for BP slimming with MF terms + part_of(BP) extensions as with precomposed MF-part_of-BP terms, even though they are semantically equivalent.

At present there are oodles of tools that can follow the part_of link from a precomposed term to the BP term, so BP silm totals include gene products annotated to the precomposed MF term. So far, there are very few tools that can use the information in a part_of(BP) extension to get an annotation to an MF term counted towards the slim total for the BP term in the extension.

pgaudet commented 6 years ago

This slimming argument has been misleading.

There are two possible ways A involved in B can be captured: 1) using the precomposed term, with an ontology structure like this:

2) annotating A with extension B, but also annotating B (or that somehow B gets exported from the extension).

In both cases the statistical power should be the same. The advantage of 2 is that it simplifies the ontology, so annotations are likely to be more consistent.

It seems to me we are somewhere in the middle in that the expected annotations are not exported from the extensions, which leads people to request precomposed terms. However a better approach would probably be to make both annotations manually (and add the extension, to allow users to see the connections, while keeping the ontology simple).

Pascale

mah11 commented 6 years ago

Fair enough. But I think the point about what currently available tools can do is relevant generally, not just to slimming. For any annotation usage, there are more tools available at present that can make sensible use of part_of links than extensions.

I'm planning to stay out of the debate about whether that means it's premature to obsolete existing precomposed terms now. But it will be a very good day when more tools can do more with extensions, not just those equivalent to precomposed MF-part_of-BP.

ValWood commented 6 years ago

I have seen the annotations for the BP extracted from the "MF involved in BP" in the inference files though I am sure (because I complained about them being duplicates of the BP terms we had already made manually, and it took me a while to figure out that they came from unpacking the extension!). So I think that this already happens but @cmungall can confirm....

ValWood commented 6 years ago

I was referring to enrichment not slimming.....I was pointing out possible problem with pre-composed terms for interpretation of enrichments (users do enrich over MF, because they don't understand GO fully). This is an additional potential problem arising from pre-composed MF involved in BP terms.

ValWood commented 4 years ago

This issue can probably close but it would be useful for GO curators to know the current status of MF involved in BP terms

Will new MF involved in BP terms be added? Will older MF involved in BP terms be removed over time?

I ask this because some terms have many annotations. Keeping these terms will result in the acculumulation of more annotations which makes it more difficult to fix later.

Basically I just need to know should we ever be using MF involved in BP in modern annotation? I guess that makes this a "documentation" issue?

ValWood commented 1 year ago

This is in progress on an ad hoc basis but there is no specific action in this ticket (except for "transporter involved in BP" which. were done eons ago)