geneontology / go-ontology

Source ontology files for the Gene Ontology
http://geneontology.org/page/download-ontology
Creative Commons Attribution 4.0 International
220 stars 40 forks source link

General pattern needed for compound functions #16972

Closed pgaudet closed 3 years ago

pgaudet commented 5 years ago

From @dosumis on December 12, 2016 17:25

We have many compound functions in GO. Sometimes this is reflected in multiple axes of classification. For example: 'ATPase activity, coupled to transmembrane movement of substances' is classified both under transmembrane transporter activity and under 'ATPase activity'. In other cases one component of the compound function is used for classification while another has a has_part relationship to the compound function.

For automated classification, using multiple has_part relationships would work well. has_part also works well for LEGO templates, as it exposes the individual components so that regulation edges can be linked directly to them. Unfortunately has_part is useless for grouping of annotations (although see this proposal: ). Also, going over to entirely using has_part would break some of the existing heirarchy in places where it feels intuitively right. For example, receptor tyrosine kinases are classified as kinases as well as receptors.

Is there a logical way we can get around this conundrum?

Would this GCI be crazy?

'molecular function that has_component some X' SubClassOf X

?

Could be added programmatically for MFs that are components of other MFs. has_component potentially less damaging as it is non-transitive.

CC @cmungall

_Copied from original issue: geneontology/molecular_functionrefactoring#25

pgaudet commented 5 years ago

From @cmungall on December 14, 2016 3:27

I'm not loving it...

I'll look at the proposal for annotation grouping later (offline right now).

pgaudet commented 5 years ago

From @dosumis on December 14, 2016 10:35

Not too surprised. One big negative is that it screws with our ability to add disjoints.

pgaudet commented 5 years ago

From @dosumis on May 11, 2017 18:3

A new general pattern suggestion

(The proposal at the top of this ticket should be ignored. It's dumb.)

Background:

MF design patterns need to cope with the compound nature of many molecular functions while (a) keeping classification that is intuitive to biologists; (b) supporting the curation of LEGO (GO-CAM) models with unbroken chains of causal relations; (c) being easy for LEGO (GO-CAM) curators to use; (d) keeping LEGO (GO-CAM) models as simple as possible: curators should be able to choose whether or not to include subfunctions with as little loss of information as possible.

Earlier attempts at defining compound functions used has_part (or some subProperty of it) for all components of a compound function including effector function. This approach is particularly bad at supporting unbroken causal chains (see this comment for an illustration: https://github.com/geneontology/molecular_function_refactoring/issues/31#issuecomment-282778313). They also suffered from somewhat unintuitive classification. Biologists typically expect classification under and effector function - so an RTK is_a kinase, PKA is_a kinase (not a transducer with parts cAMP sensor with kinase activity) and ATPase coupled K+ transporter is_a K+ transporter, a transcription factor is transcription regulator.

Proposal

(In reading this proposal, please bear in mind that all inverses are assumed to be automatically inferred)

With this in place, we still get continuous chains of causal relations (and resulting inference) whether the regulatory edge points to the effector function or a subfunction that is causally related to it. We can also eliminate additional regulatory edges internal to compound functions in GO-CAM.

Any additional classification under Paul's new upper level classes (e.g. molecular transducer) should be inferable based on these design patterns.

Sketch:

"molecular transducer activity" EquivalentTo: molecular_function that has_regulatory_component some molecular_function ?

"phosphorylation sensing molecular transducer activity": EquivalentTo: molecular_function that has_regulatory_component some 'phosphorylation sensor activity'

(whether we want such high level classes is something we can decide separately, this just illustrates how we could get inference to them).

A sketch of potential object properties:

image

Some examples of how this could work (ontology design patterns):

image

image

image

We might want an explicit has_component_function to => domain and range restriction to MFs.

We could flesh this out with more relations (has_energy_source ? e.g. for ATPase-coupled transporter example?)

GO-CAM examples:

TF activity - modified from one of Astrid's test models: Note that we get complete regulatory chains whether going via the DNA binding component or the effector (genus).

image

(All inverses are inferred in GO-CAM models, so we could flip has_necessary_component -> necessary_component_of in the opposite direction if that is clearer)

Inferred classic GO annotations:

Can we also get these ? (depends on #49)

Inferred upper level MFs:

(* Assuming we want this class.)

Use of these patterns would, of course, be eased by the definition of GO-CAM template patterns to go along with the ontology design patterns. In this case a design pattern would drive a table something like this allowing input of DP components:

DNA binding transcription type regulatory effect
{ Sequence specific DNA binding } {transcription, DNA templated } { directly_regulates }
RNA pol II regulatory region sequence specific DNA binding transcription from RNA pol II promoter directly_positively_regulates

First row shows range class/relation; second row shows example fillers.

Possible extensions:

There has been some discussion of how we might represent logic gates in LEGO. This may be beyond the expressiveness of OWL, but we could, at some point in future, add support internal component nodes representing logic gates, that sit between regulatory component and the effector function that correspond to logic gates.

@cmungall @thomaspd @ukemi Comments please.

pgaudet commented 5 years ago

From @cmungall on May 12, 2017 2:59

Add to agenda for tomorrow?

pgaudet commented 5 years ago

From @ukemi on May 12, 2017 14:19

@vanaukenk

pgaudet commented 5 years ago

From @dosumis on May 24, 2017 14:6

In absence of objections, I'm starting implementation.

pgaudet commented 5 years ago

Whatever decision we make can the be applied to

pgaudet commented 5 years ago

Hello,

I'd like to discuss how to handle compound functions. See proposal here: http://noctua.geneontology.org/workbench/annpreview/?model_id=gomodel:5c4605cc00001715

Essentially the design pattern would be:

(for example:

What we need is for 'has part' to be transitive for MF (annotations using 'enables' as a relation), but not for any other relations.

Is this possible ? Does anyone see any problem with this suggestion ? (which is essentially the same DOS made; the difference is that removing 'regulates o has part' inference now makes it possible).

Thanks, Pascale

ValWood commented 5 years ago

~Hi Pascale What do you mean here: What we need is for 'has part' to be transitive for MF (annotations using 'enables' as a relation), but not for any other relations. Do you mean as a "qualifier" I don't think you can have transitivity behave differently depending on the qualiifer, that seems like a no-no. So can we document the issues as we work through this so we can figure out what this would mean. currently problem now If you say something regulates GO:0042626 ATPase activity, coupled to transmembrane movement of substances it becomes annotated to "regulation of ATPase and this might not be correct" ie activity x directly_regulates GO:0042626 ATPase activity, coupled to transmembrane movement of substances this would also mean ie activity x directly_regulates GO:0016887 ATPase activity ATPase activity in this case then why not just make the annotation to the parent which is true (i.e use the parent term in the transport branch ? problem with change currently GO:0042626 ATPase activity, coupled to transmembrane movement of substances is_a ATPase GO:0042626 ATPase activity, coupled to transmembrane movement of substances has_part ATPase would no longer be annotated to ATPase I would expect all "ATPase activity, coupled to transmembrane movement of substances" to be annotated to "ATPase activity"~

I think I get it now

cmungall commented 5 years ago

What we need is for 'has part' to be transitive for MF (annotations using 'enables' as a relation), but not for any other relations.

has part is always transitive, what we want here is for gene products to propagate from MFs to their parts. i.e. enables o has_part -> enables

yes, this is easily done. Do we have a set of competency questions for this?

pgaudet commented 5 years ago

@cmungall The behavior we want is:

MF + ENABLES:

MF + REGULATION

CC or BP + (any relation)

Does that answer your question?

Thanks, Pascale

dosumis commented 5 years ago

This is quite close to my proposal, the distinctions being:

  1. I had has_part for both components. I think it's probably fine to choose the effector as genus though, but It would be interesting to see a set of cases
  2. I dealt with with regulatory chains inside components by having special has_part relations for component MFs that control the activity of others component MFs. These sit under bother has_part and regulates (see above).

In the case of ATPase coupled transporters - regulating ATPase activity regulates the transporter activity but not vice versa. This would allow a GO-CAM model in which the ATPase component had it's own node and regulatory edges coming to it => inference of regulation of transporter activity.

More detail than I can now remember and probably more than you can be bothered to read can be found here.

https://docs.google.com/document/d/18VQaVHvTmNvJuRczlnmJnjLtd2WxvfGafXoU1LFEQ2c/edit#heading=h.hmy4b7ry6jqu

ukemi commented 5 years ago

We also need to think about this in the context of the Rhea alignment. If we take the definition of RHEA:13065 and reason over it as the reasoning currently stands, any compound function that has ATP + H2O as inputs and ADP + H(+) + phosphate as outputs will be classified as an ATPase. This will violate the has_part assertion proposed above. We would need to define a logic that results in the has_part relationship. I think it is doable, but needs to be considered.

ukemi commented 5 years ago

Need to consider any proposal with respect to: https://github.com/geneontology/go-ontology/projects/10

pgaudet commented 5 years ago

Hello,

Can we use the relations 'has primary input' and 'has primary output' to make that distinction ? For example: 'amide-transporting ATPase activity' could be defined as

Could that work ??

Thanks, Pascale

cmungall commented 5 years ago

I'm not sure this is necessary. I think we can keep it simple.

David H, I don't think there is any concern if there the reaction is connected to via has-part (input and output would not propagate over has-part).

I would need to see a broad range of proposed CFs (compound functions) to be completely convinced though (and to see the difference with David OS's original proposal). Does such a list exist?

pgaudet commented 5 years ago

Some examples: (note that these are not necessarily logical definitions; I am not sure all are necessary and sufficient)

- cAMP-dependent protein kinase activity 'protein serine/threonine kinase activity' and ('has positive regulatory component activity' some 'cAMP binding') -> would be changed to 'protein serine/threonine kinase activity' and ('has part' some 'cAMP binding')

- signaling receptor activity 'molecular transducer activity' and 'has part' some 'enzyme regulator activity' and 'has part' some 'receptor ligand binding' (?? this term would have to be created)

- DNA binding transcription factor activity 'transcription regulator activity' and 'has part' some 'DNA binding'

- nuclear receptor activity 'ligand-activated transcription factor activity' and 'has part' some DNA binding and 'has part' some 'receptor ligand binding' (again, this term would have to be created)

This is also based on @dosumis 's examples here https://github.com/geneontology/molecular_function_refactoring/issues

Do these look OK ?

cmungall commented 5 years ago

Haven't had time to think through these but they seem fine. It looks like @dosumis scheme would buy us additional inference over regulates?

pgaudet commented 5 years ago

It looks like @dosumis scheme would buy us additional inference over regulates?

I don't think so - it's just that since the rules were different, he had to create extra relations (I hope I understood your question).

pgaudet commented 5 years ago

In other words, I think @dosumis would have done something very similar to what I proposed if the inference rules had been different at the time.

dosumis commented 5 years ago

In other words, I think @dosumis would have done something very similar to what I proposed if the inference rules had been different at the time.

cAMP-dependent protein kinase activity 'protein serine/threonine kinase activity' and ('has positive regulatory component activity' some 'cAMP binding') -> would be changed to 'protein serine/threonine kinase activity' and ('has part' some 'cAMP binding')

Using 'has positive regulatory component activity' allows (+vely) regulates inference to propagate from 'cAMP binding' to the protein kinase activity. It would not be safe to allow any similar regulation of inference over has_part. What if the 'cAMP binding' inhibits protein kinase activity or has some other purpose? For the same reason, using has_part here also doesn't => a necessary and sufficient definition (it would apply to a cAMP inhibited protein kinase activity too).

With my proposal, this GOCAM model:

(MF X enabled_by Y)-[positively_regulates->(cAMP binding enabled_by Z)<-[has positive regulatory component activity]-(protein kinase activity enabled_by Z)

=> inferences:

Z enables cAMP-dependent protein kinase activity Y enables_positive_regulation_of* cAMP-dependent protein kinase activity

(* Think I've got the relationship name wrong here, but hopefully you get the idea)

Note - none of this needs additional rules as inference depends on property heirarchy: 'has positive regulatory component activity' sits under 'has_part' and 'positively regulated by'

pgaudet commented 5 years ago

@dougli1sqrd has ran the inferences that would be created by the axiom enables o has_part -> enables

http://skyhook.berkeleybop.org/special_inferences/products/annotations/

The inferences look ok to me - but I would like for @vanaukenk @ukemi and @ValWood to have a look.

Questions for editors call:

  1. Chris said that has part should not be used because it is reciprocal with part of, and therefore inferences rules for part_of would apply (please correct or improve explanation!). I am OK switching to has necessary component activity andhas regulatory component activity - is this what we want to do?
  2. Regardless of whether we use has part or has necessary component activity andhas regulatory component activity, we would like these relations to be shown by default to users; what are the options ? In browsers that can be done, but what about for files we provide ? Do we need to change the relations in go.obo and go-basic.obo (and can we do that) ?

Thanks, Pascale

ukemi commented 5 years ago

We should check the files for each of our respective resources.

vanaukenk commented 5 years ago

@dougli1sqrd @pgaudet

I'm looking at the wb.inferred.gaf and just have a quick question about the lines of inferred annotation I see.

Using the gene aex-2 (WBGene00000085) as an example:

aex-2 has an IC annotation to 'neuropeptide receptor activity' (GO:0008188).

In the wb.inferred.gaf, I see two additional annotations to 'peptide binding' (GO:0042277), but they look to be the same:

WB WBGene00000085 aex-2 GO:0042277 PMID:23583549|WB_REF:WBPaper00042242 IC GO:0007218 F T14B1.2 gene taxon:6239 20140818 WB

WB WBGene00000085 aex-2 GO:0042277 PMID:23583549|WB_REF:WBPaper00042242 IC GO:0007218 F T14B1.2 gene taxon:6239 20140818 WB

In the wb-inferences.log file I see two different inferences for 'neuropeptide receptor activity' (GO:0008188), one for 'peptide binding' and the other for 'neuropeptide binding':

For GO:0008188 "neuropeptide receptor activity": Has Parent --> GO:0008188 "neuropeptide receptor activity" has_part --> GO:0042923 "neuropeptide binding"

For GO:0008188 "neuropeptide receptor activity": Has Parent --> GO:0001653 "peptide receptor activity" has_part --> GO:0042277 "peptide binding"

So, I'm not seeing the 'neuropeptide binding' annotation in the wb gaf.

Is that expected?

dougli1sqrd commented 5 years ago

Oh you're saying you would expect one to be with GO:0042923, and one with GO:0042277? But instead we see just two with the same GO ID. That looks it could be a bug? I'll look into it.

vanaukenk commented 5 years ago

Hi @dougli1sqrd

Yes, exactly; I'd expect to see one annotation for each has_part GO term. In this case that one would be one to GO:0042923 and one to GO:0042277.

Thx.

ukemi commented 5 years ago

Do we really want to see one with each has_part, or just with the most specific has_part? It looks like in the inferences log, the inferences are generated up the subclass hierarchy. Don't we just want the asserted annotations and then the additional annotation/s to the most specific inference based on the specified has_part?

pgaudet commented 5 years ago

Discussing this with @ValWood
'has component' definition: w 'has component' p if w 'has part' p and w is such that it can be directly disassembled into into n parts p, p2, p3, ..., pn, where these parts are of similar type.

So, this cannot apply to coupound MF, since these cannot be disassembled or uncoupled.

pgaudet commented 5 years ago

@ValWood Inferences look OK. This proposal works - ie generating the annotations over the has_part relation. If we don't propagate - we have the same pb as before in that we need to co-annotate.

pgaudet commented 5 years ago

Looking at examples:

cAMP-dependent PK activity (gene1,gene2) [gene1 is the PK] [gene2 binds cAMP]
is_a PK -> A, B has_part cAMP binding

gene 1: enables cAMP-dependent PK activity -> ATP binding gene 2: contributes cAMP-dependent PK activity (or PK regulatory activity)

The ontology is incorrect here:

cAMP-dependent protein kinase activity 'protein serine/threonine kinase activity' and ('has positive regulatory component activity') some 'cAMP binding')

PKR cAMP-dependent protein kinase inhibitor activity Should have 'protein serine/threonine kinase inhibitor activity' and 'has part' some 'PKA binding' (PKR binding to cAMP should be a separate annotation?)

image

======== DNA binding Tx factor activity

Dimer A, B (B does not bind DNA)

Pascale

ukemi commented 5 years ago

My main concern is that it seems like a bit of a kluge in that we need to restrict the inference computation on MFs and it breaks down with BPs. This will be even worse when we make MF a BP. Would it be better to create new relations similar to the 'has component' ones (these need to be tidied up in RO)?

has_integral_process--P1 has_integral_process P2 iff P1 and P2 are enabled_by C1 and P2 is an occurent_part_of P1.
is_a occurent_part_of inverse_of is_integral_process_of is_integral_process_of--P1 is_integral_process_of P2 iff P1 and P2 are eneabled_by C1 and P1 is an occurent_part_of P2.

ValWood commented 5 years ago

Different relations might be OK but if so they could not be a child of has_part (has_compenent is)

It isn't clear to me what breaks down with process. The inference on MF only uses "enabled_by" and this is not in BP so the inference would not be made. In this respect they are different.

ukemi commented 5 years ago

Enabled_by is a good point. We need to check to be sure that there are no chains that would create annotations to, for example, processes that have has_part relationships to functions or other processes.

cmungall commented 5 years ago

It isn't clear to me what breaks down with process. The inference on MF only uses "enabled_by" and this is not in BP so the inference would not be made. In this respect they are different.

Exactly.

And to me this is quite intuitive too. If I am carrying out a whole task by myself (putting up a bookshelf), and I conceive of that task as being split into parts, I am carrying those tasks too. But if I am carrying out a task that is part of a larger whole (e.g moving house), with other tasks carried out by others, there is no inference about my relationship to those other tasks.

ukemi commented 5 years ago

That does make intuitive sense, but we need to be careful that we haven't overlooked anything. The test gafs should help with that.

ukemi commented 5 years ago

Actually I think we are good. When a function is part of a process we don't resolve to enables for the enabler. Somewhere in the back of my mind I think Jim and I Iooked at this, but I had forgotten.

pgaudet commented 5 years ago

Other considerations (via email with @krchristie ):

Hi Pascale,

It was some time ago, and I don't think it was just Val, but my recollection is that she was one person who was concerned about these types of things.

Propagating annotations across has_part is clearly fine when the object annotated to the complex function is a gene product that acts on its own and thus that single gene product possesses ALL of the various individual functions that the complex function has a has_part relationship to.

However, when the object that enables the complex function is a complex, it is not straightforward to propagate the individual functions that are parts of the complex function to individual gene products. There have been several examples mentioned over the years with this, including translation initiation complexes where only a single subunit contains the ATPase activity, numerous enzymes where only a single subunit possesses the catalytic activity, and DNA-binding transcription factor complexes. For example, if you have a DNA binding transcription factor complex that is made up of two subunits, only one of which had DNA binding activity, then it would be incorrect to propagate an annotation to "DNA binding" to both subunits via a relationship such as "DNA-binding transcription factor activity" has_part "DNA binding". When I was still at SGD, Val was definitely one of the people who did not want to see erroneous MF annotations to subunits that did not possess an activity that was actually possessed by some other subunit.

I should add that I no longer know what terms actually have has_part relationships since it's not simple to see them and it looks like some of the ones that used to be present have been removed. But if we still have has_part relationships on complex functions that are enabled by complexes, it seems that inappropriate propagation of the individual functions to all subunits would still be a problem.

-Karen

I think the examples @ValWood and I discussed in https://github.com/geneontology/go-ontology/issues/16972#issuecomment-509669915 address those points.

As far as I can tell we are good to go with this relation and inference chain. Lets discuss this once again at the next ontology call to make sure we're all on the same page.

Thanks, Pascale

dougli1sqrd commented 5 years ago

Hello @vanaukenk and others, I have updated the inference code and fixed the bug you pointed out. The files are in the same location as stated above: http://skyhook.berkeleybop.org/special_inferences/products/annotations/. Let me know if that works!

vanaukenk commented 3 years ago

Adding to the agenda for the next ontology editors call. We need to reconcile the current proposal with GO-CAM curation guidelines.

deustp01 commented 3 years ago

An assignment from the ontology discussion on November 23 was to identify gene products with compound functions.

Two examples come from metabolism - Fatty acid biosynthesis - a fatty acid chain is elongated 2 carbons at a time in a sequence of five reactions. In bacteria and archaea, seven(?) different genes encode the enzymes that enable these reactions; in eukaryotes, one gene encodes a 7(?)-domain enzyme that does it, one domain for each reaction. The eukaryotic gene product has a clever feature that efficiently moves the substrate from one active site to the next (PMID: 12689621). UMP biosynthesis - a six-step process. The first three steps are catalyzed by active sites on CAD gene product and the last two by active sites on UMPS gene product (PMID: 6105839).

In the other direction, ABC transporters can be seen as both moving a small molecule across a membrane and also hydrolyzing ATP to ADP + phosphate, but in fact the ATP hydrolysis drives a conformational change in the transporter that enables it to trap a substrate on one side of the membrane and release it on the other - there might be mutant gene products that only hydrolyze ATP, but the physiological function of the transporter involves both transport and coupled hydrolysis (see Alberts et al. "Molecular Biology of the Cell", 5th edition, Figures 11-14 and 11-15). So this looks like a single complex reaction mechanism, and not a compound function.

pgaudet commented 3 years ago

Decisions on ontology call: http://wiki.geneontology.org/index.php/Ontology_meeting_2021-02-01#.28Compound.29_Molecular_Functions_--_BLOCKING_FOR_GO-CAM

 Molecules with ATPase activity
    RNA helicase (not DEAD-box, they seem to be different)
        => Resolved: is_a ATPase, being implemented, including for transporters (ABC and P-types)
        OK 
DNA-binding transcription factors
    p53/p21
        => Resolved: has_part 'transcription regulatory region sequence-specific DNA binding', DONE, see Ticket 16214
        OK 
Signaling adaptors and sequestering activities
    ced-4 activity in core apoptotic program
        => Sequestering activity: Resolved: protein sequestering activity: is_a 'molecular sequestering activity' and ('has part' some 'protein binding')'
        => Signaling adaptor activity: Proposed: is_a 'protein-macromolecule adaptor activity' and (has_part some 'protein binding' REDUNDANT) and part of' some 'BP signaling (not yet implemented)
        OK 
Receptors
    TM receptor S/T-PK activity, example: BMP receptor activity
        => Proposal: transmembrane receptor protein serine/threonine kinase activity: is_a 'signaling receptor activity' and has part some 'protein serine-threonine kinase activity' and has_part receptor ligand binding)
        TO DO: Check with Ruth
    G protein-coupled receptors (coupled to inhibitory G proteins (Gi))
        => Proposal: 'G protein-coupled receptor activity' is_a 'signaling receptor activity' and other MF = has_part
    Different ways of modeling ire-1, an unfolded protein receptor with two other activities
        => To complete - Kimberly: sorted on GO-CAM call: we would create a new term with all 3 activities

We'll start implementing that and open new tickets if needed.

ValWood commented 3 years ago

RNA helicase (not DEAD-box, they seem to be different)

Why do DEAD box seem to be different? I don't have an example of any RNA helicase (including DEAD box) that isn't currently annotated as an ATPase? In fact, I don't have aby helicase annotated that isn't an ATPase. That might be incorrect, but an example of a non-ATPase without would be a good starting point to check.