geneontology / pathways2GO

Code for converting between BioPAX pathways and Gene Ontology Causal Activity Models (GO-CAM)
8 stars 0 forks source link

Represent enzymatic complex enablers according to GO-CAM spec #302

Open thomaspd opened 1 year ago

thomaspd commented 1 year ago

For Reactome complexes that are controllers of enzymatic reactions, these should be aligned with GO-CAM to specify gene product IDs and not PRO IDs. To do this, we will need to handle two different cases: 1) If Reactome specifies the catalytic subunit, the enabler of the reaction should be the catalytic subunit and not the complex. The rest of the complex should be ignored for now. 2) If Reactome does not specify the catalytic subunit, the enabler of the reaction should be a GO protein-containing complex instance (GO:0032991) with has_part links to each protein subunit. Small molecule components of complexes should be ignored for now. So if a Reactome complex is composed of just one gene product and one or more small molecules, then it should be treated the same as case 1 above and connect the activity directly to the gene product without a protein-containing complex individual.

deustp01 commented 1 year ago

Yes! We have treated case 2 as obviously true for a long time, but I'm not sure the code to enable it has ever been implemented in the GO-CAM conversion process. And when it is implemented, it should also generate a report of all the catalystActivity instances that it fixed in this way, to be fed back to Reactome to patch the Reactome annotations that are the single source of truth here. @dustine32 ?

dustine32 commented 1 year ago

@deustp01 Yes, we can log out the number 2 cases where the resulting "complex" only has a single has_part GP.

@thomaspd If you can find an example complex with the active unit specified that would help. I'll keep looking too.

dustine32 commented 1 year ago

Also, asking @huaiyumi for any examples of active subunit annotation in Reactome that I can look for in the BioPAX.

deustp01 commented 1 year ago

There are 1784 catalystActivity instances in our central database whose activeUnit slot is not null. Let me figure out who to ask here to get you a table of the subset of these instances that has actually been released. We should be able to generate a table of the dbID of each instance, its physicalEntity (the complex), its activeUnit (the individual EWAS gene product), and the dbID and name of the reaction in which it occurs. Are there other attributes you'd want in the table?

deustp01 commented 1 year ago

@dustine32 but meanwhile, here is a short arbitrary list of catalystActivity instances whose physicalEntity is a heteromeric complex and whose activeUnit is a protein monomer, as a starting point to begin to explore the BioPAX to see what can be done on point 2, above, and what a useful format would be for bulk processing.

https://reactome.org/content/schema/instance/browser/1806156 https://reactome.org/content/schema/instance/browser/5676051 https://reactome.org/content/schema/instance/browser/6798176 https://reactome.org/content/schema/instance/browser/1806283 https://reactome.org/content/schema/instance/browser/8868073 https://reactome.org/content/schema/instance/browser/5358378 https://reactome.org/content/schema/instance/browser/109879 https://reactome.org/content/schema/instance/browser/9836928

Each URL points to a page that lists the names and dbIDs of the heteromeric complex, the protein monomer activeUnit, and the reaction that the caqtalystActivity mediates.

I can also make a list of samples of catalystActivity instances where the physicalEntity is a set of heteromeric complexes and the activeUnit is a set of monomers or a set of subcomplexes, also of cases where the heteromeric complex involves both protein and non-protein (RNA or DNA or small-molecule) subunits, if any of those are of interest.

I hope, from this test material, we can figure out what you need in a comprehensive list.

ukemi commented 1 year ago

@dustine32 Does the catalyst activity here help? R-HSA-21271

dustine32 commented 1 year ago

@deustp01 @ukemi Thank you for these examples! I don't really need the full list of all activeUnits as these few helped me find where in the BioPAX I can expect to find them. An example for reaction R-HSA-1675883:

  <bp:Catalysis rdf:ID="Catalysis1698">
    <bp:controller rdf:resource="#Complex3671" />
    <bp:controlled rdf:resource="#BiochemicalReaction3397" />
    <bp:controlType rdf:datatype="http://www.w3.org/2001/XMLSchema#string">ACTIVATION</bp:controlType>
    <bp:xref rdf:resource="#RelationshipXref3080" />
    <bp:xref rdf:resource="#RelationshipXref3090" />
    <bp:dataSource rdf:resource="#Provenance1" />
    <bp:comment rdf:datatype="http://www.w3.org/2001/XMLSchema#string">activeUnit: #Protein9680</bp:comment>
  </bp:Catalysis>

Here, the activeUnit, which eventually points to PI4KB [Golgi membrane] (Homo sapiens) for complex ARF1/3:GTP:PI4KB, is embedded in a comment field. Not the greatest feeling about this placement but it'll definitely do for now!

deustp01 commented 1 year ago

is embedded in a comment field

If I understand what you're saying correctly, yes, if you look at the instancebrowser view for the EWAS PI4KB [Golgi membrane] (Homo sapiens) its role as the activeUnit of a complex involved in catalysis is shown as a comment. But if you come at the annotation from the other direction - start with the catalystActivity instance 1-phosphatidylinositol 4-kinase activity of ARF1/3:GTP:PI4KB [Golgi membrane] then its role is shown as an attribute. Or am I misunderstanding the problem?

Also, it makes sense to me to work starting from reactions that have associated catalystActivities, systematically looking at what the physicalEntity of each catalystActivity is, and if that physicalEntity is not at EWAS or set of EWASs, then proceed further to see if it fits case 2 above.

Also also, if a by-product of this survey were a list of catalystActivites where the physicalEntity is a complex or set of complexes but the activeUnit slot is null, that list would be the starting point for re-curation to fill the empty slots. And if in each case the components of the complex could be checked in the central GO annotation file to see if any have been assigned the same GO molecular function as Reactome has assigned to the whole complex, that would make the re-curation process at Reactome much faster and more reliable. @ukemi I know we talked about something like this with Ben Good; I don;t know how close he got to implementing it.

Or does this last part duplicate work you've already done to generate the tables described in #296 (which I haven't looked at yet)?

deustp01 commented 1 year ago

And if in each case the components of the complex could be checked in the central GO annotation file to see if any have been assigned the same GO molecular function as Reactome has assigned to the whole complex, that would make the re-curation process at Reactome much faster and more reliable. @ukemi I know we talked about something like this with Ben Good; I don;t know how close he got to implementing it.

That failed - in many cases the catalystActivity of the whole complex has been assigned to all of its protein components - but perhaps re-doing it with the cleaned-up fly set of complex component functions would yield good results.

deustp01 commented 1 year ago

Summarizing the discussion so far as a to-do list.

deustp01 commented 1 year ago

@deustp01 Yes, we can log out the number 2 cases where the resulting "complex" only has a single has_part GP.

@dustine32 @ukemi just to document the current state / need, here is an active unit in the first reaction of the "carnitine biosynthesis" pathway and in the derived GO-CAM. The physical entity is a complex involving one copy of one gene product and one copy each of a couple of chemical entities. Can the GO-CAM generation script be re-done to identify the gene product and make it the enabler?

Screenshot 2023-11-14 at 5 47 50 PM

Or if that is hard or dangerous, can we plan to generate a list (partial is OK to start) of the number 2 cases, that we can use to figure out how to bulk-edit the Reactome annotations to add the missing activeUnit annotations, so that the existing GO-CAM genertion script can use them? (A practical issue here is whether David and I, as we (re)curate pathways should add this information manually as part of our work, or leave it out because a script will soon be available to do it automatically?

ukemi commented 2 months ago

@dustine32 I think this proposal is directly in line with what we had talked about on the call today. I think you have already done it for when there is a single GP as the enabler, but we should do it with the complexes too. For these cases, would it be possible to use the UniProt GCRP identifiers instead of a REACTO id? We need to start weeding away the Reacto identifiers.

ukemi commented 38 minutes ago

Hi @dustine32. Note that part of the requirement in the second point of this post is to ignore small molecules.

If Reactome does not specify the catalytic subunit, the enabler of the reaction should be a GO protein-containing complex instance (GO:0032991) with has_part links to each protein subunit. Small molecule components of complexes should be ignored for now. So if a Reactome complex is composed of just one gene product and one or more small molecules, then it should be treated the same as case 1 above and connect the activity directly to the gene product without a protein-containing complex individual.

This is consistent with the discussion in #327.