Ensure reactome go-cams validate against main GO shex schema

goodb commented 5 years ago

Revisit and rework conversion rules or shex schema to bring them into alignment. Close when everything passes. Re-open individual tickets for future failures.

[x] Implement re-usable software framework for testing models that is integrated with this repository's travis system
[ ] Complete a clean validation run using this framework

deustp01 commented 4 years ago

In Ben's list of key problems, I don't understand the first one -

In the conversion, complexes are used as inputs, outputs, and objects of transport for Biological Process nodes.

@goodb could you point to an example to help me see what the problem is? The second one -

In the conversion, Molecular Function nodes are assigned causal (regulates, provides input for) relations to Biological Process nodes.

... could be arising because it is legal to manually assert that any event (reaction / function or pathway / process) precedes any other. We really shouldn't do this so, to the extent that that is the problem here, a clean-up is the way to go. But I don't see how to create an assertion that a reaction / function regulates a process so an example to help see where the problem is coming from would be useful.

But this comment on issue 1 and Kimberly's go in completely different directions - maybe I've misunderstood; sort out on call tomorrow?

goodb commented 4 years ago

For 1. @deustp01 here is an example from the Scavenging by Class H Receptors pathway. (Note that what you see here may be different and is a more recent version than what you see on noctua-dev.)

Screen Shot 2020-01-14 at 3 13 11 PM

Here, the reaction "STAB2:ligand is endocytosed" is typed by the converter as a Transport biological process. This does not cause any problems for the OWL reasoner, but introduces a number of violations of the schema that is being developed to define the overall rules for the structure of GO-CAM models. (Hence the red dot). The schema currently prevents complexes from being used as the objects of relations like has input, has output, and transports or maintains localization of when the subject is a biological process such as a transport.

For 2. The same pathway provides another example where a binding reaction has a causal relation to transport process. As things stand, we have a number of rules in the converter that infer regulatory relationships (beyond 'causally upstream of' which we get from the nextStep relations in the BioPAX). These generate the other kinds of relationships between functions and these process nodes.
Screen Shot 2020-01-14 at 3 19 16 PM

ukemi commented 4 years ago

I think an accurate representation of biology requires that complexes be allowed as transport objects. PMID:9852151, PMID:10545491, PMID:11102514.

deustp01 commented 4 years ago

Here, the reaction "STAB2:ligand is endocytosed" is typed by the converter as a Transport biological process.

Transport is correct but why is this Reactome event getting typed as a transport process rather than as a reaction mediated by a transporter function? Probably this is happening because the Reactome reaction "STAB2:ligand is endocytosed" is missing information. We could have assigned a molecular function like GO:0005044 scavenger receptor activity by overloading our catalystActivity class to allow GO cargo receptor activity terms (just as we could do to explicitly annotate binding reactions). Short-term fix - maybe none is needed (we can tolerate schema violations temporarily?) Long-term fix - this is part of a larger project that both Reactome and GO need to develop a reaction typology (binding vs catalysis vs transport vs etc.) and implement it for Reactome curation.

For second item on Ben's list, same cause and possible fix?

ukemi commented 4 years ago

I @goodb. I have added a third model to #75 for discussion.

goodb commented 4 years ago

Update. Following recent changes for #75 and #84 , I tried to run all the models through the shex/owl validation over the weekend.

On the down side, things have really slowed down. Only 1122 models completed the validation over the span of about 40 hours. It seems the expansion of the shex schema is hitting performance pretty harder (to do list item to try to improve that).

On the up side, we are getting close. Of those 1122 models, all passed the OWL test, and only 35 failed the shex validation.

Model R-HSA-189483 failed to validate because it uses an unknown GEE https://reactome.org/content/detail/R-CPE-9661737 - apparently from Clostridium.

Everything else that I looked at failed because there is a reaction with more than one enabling molecule.

Here is a list of all of the reactions with the multiple enabler problem (based on curator-version BioPAX from Jan 8, 2020): Multiple Enablers.txt

ukemi commented 4 years ago

Model R-HSA-189483 failed to validate because it uses an unknown GEE https://reactome.org/content/detail/R-CPE-9661737 - apparently from Clostridium.

Fascinating! I need to think about this one.

goodb commented 4 years ago

For shex, it would validate with the fix we've used elsewhere of xrefing it to chebi:protein. How the multi-species part of this goes, not sure.

ukemi commented 4 years ago

Is it the first step towards modeling microbiome interactions? Might be interesting for @cmungall and @vanaukenk. I think your shex solution is reasonable, it's the scope that I am pondering.

deustp01 commented 4 years ago

Hmm. What's going on here is that bilirubin glucuronides transported from hepatocytes into the bile, and into the gut lumen, are acted on sequentially by three bacterial enzymes, beta-glucuronidase (GUSB), biliverdin reductase (BILR), and UBGNR to form stercobilin.

We annotated C. perfringens GUSB (UniProt Q8VNV4 - a TrEMBL instance) as the enzymes mediating the first reaction. We couldn't identify specific gene products to assign as enzymes to mediate the second and third reactions so instead created instances of the genomeEncodedEntity class to annotate them, all on the strength of evidence from PMID:16504607.

Looking back, the evidence that these reactions happen in the gut is solid but I don't know how solid the evidence is that C. perfringens, better known as the causative agent of gas gangrene and a major causative agent in human food poisoning) is the main source of the enzymes that mediate them in healthy humans.

So there may be an annotation issue here. It's possible (to be discussed with curators) that we should retreat to asserting that unknown bacterial proteins catalyze these reactions, with no attempt to identify a bacterial species. Ben's suggestion of a chebi:protein xref looks like a good way out of the immediate shex issue (with an acceptably low amount of information loss, I think).

On the future of microbiome annotation, the odd annotation here points to a bigger problem, that genomic diversity within a bacterial species can be very large, so associating specific molecular functions with specific proteins will often be hard and even when it can be done, the result will often be a protein set (in the Reactome sense of set). Does this also sort of say something about scope?

ukemi commented 4 years ago

It says something very big about scope and how scope may differ. Is the GO scope of human annotation to annotate the function of human gene products, or is it to annotate any gene product used in a normal human physiological process? Same for Reactome....

ukemi commented 4 years ago

Note that annotating C. perfringes gene products is not out of scope for GO. Example: http://amigo.geneontology.org/amigo/gene_product/UniProtKB:Q8XLE8

deustp01 commented 4 years ago

Same for Reactome...

Reactome is inclusive: if it affects human biology, we're prepared to annotate it with the proviso (shout-out to Judy here) that we don't get credit against our annotation metrics for non-human proteins

deustp01 commented 4 years ago

Note that annotating C. perfringens gene products is not out of scope

Sure - handling symbiosis is the complicated bit, especially near the normal physiology : disease process boundary.

goodb commented 4 years ago

One thing @deustp01 's comment suggests to me is that it its going to be increasingly important to think about annotating at the gene family level rather than the individual gene level when that is the level of evidence. FWIW..

Here are the only entities that are currently completely unclassified and thus schema breakers: R-HBV-8982481 viral double-stranded RNA R-HCV-8982462 viral double-stranded RNA R-HSA-5357857 p-T357,S358-MLKL oligomer R-MMU-9670591 Igkc R-HSA-446069 Keratin 5/14 R-HSA-9660029 Poly-vimentin R-HSA-9660015 Poly-vimentin R-HSA-9657899 Poly-vimentin R-HIV-173644 Multimeric capsid coat R-HIV-175338 Multimeric matrix layer R-HSA-2514779 Cleaved fibrillin-2 R-CPE-9661737 UBGNR R-CPE-9661715 BILR R-HSA-9661701 UBGNO R-HSA-9663470 BGET R-ALL-9638756 RNA(n+1)

deustp01 commented 4 years ago

One thing @deustp01 's comment suggests to me is that it its going to be increasingly important to think about annotating at the gene family level rather than the individual gene level when that is the level of evidence. FWIW..

Immediate practical patch question - I haven't looked up any of these instances yet to see what's going on - some look like things I should have fixed in the xref cleanup late in the fall. Would adding a ChEBI:protein (or ChEBI:RNA) xref to each of these fix the schema-breaking problem?

Longer-term (version 2) technical issue: Reactome has a class "polymer" that up to now we have been able to evade in exports to GO-CAM. Poly-vimentin is an example. For version 1, I can xref it to ChEBI:protein, and we can discuss later how to handle a polymer instance whose monomers are a UniProt protein.

Ben's point about classes. As an editorial issue, we don't handle classes of macromolecules (e.g., "collagens", or "kinases") well because our goal is to annotate individual gene products. Our work-around to date has been the creation of sets, with defined and candidate set types as a way of handling groups with more or less evidence to support the grouping. The PRO ontology structure provides a way to identify protein families and to associate specific proteins with those families that may be a better approach to this problem. A version 2 discussion item, I think.

goodb commented 4 years ago

Yes, adding the CHEBI xrefs would solve the schema problem right now.

deustp01 commented 4 years ago

Here are the only entities that are currently completely unclassified and thus schema breakers:

All now fixed with ChEBI cross-references: R-HBV-8982481 viral double-stranded RNA ChEBI 67208 dsRNA R-HCV-8982462 viral double-stranded RNA ChEBI 67208 dsRNA R-HSA-5357857 p-T357,S358-MLKL oligomer ChEBI 36080 protein R-MMU-9670591 Igkc curation mistake R-HSA-446069 Keratin 5/14 ChEBI 36080 protein R-HSA-9660029 Poly-vimentin ChEBI 36080 protein R-HSA-9660015 Poly-vimentin ChEBI 36080 protein R-HSA-9657899 Poly-vimentin ChEBI 36080 protein R-HIV-9657899 Multimeric capsid coat wrong ID in GO-CAM? R-HIV-173644 Multimeric capsid coat ChEBI 36080 protein R-HIV-175338 Multimeric matrix layer ChEBI 36080 protein R-HSA-2514779 Cleaved fibrillin-2 ChEBI 36080 protein R-CPE-9661737 UBGNR ChEBI 36080 protein R-CPE-9661715 BILR ChEBI 36080 protein R-HSA-9661701 UBGNO ChEBI 36080 protein R-HSA-9663470 BGET ChEBI 36080 protein R-ALL-9638756 RNA(n+1) ChEBI 83400 RNA polyanion

goodb commented 4 years ago

Wish I could rerun now, but will either have to wait for this to percolate through the release or ask for another custom export from the curator site (which I hesitate to do).

deustp01 commented 4 years ago

Let me see whether I can make any progress on the list of multiple enablers before we decide what to do here.

deustp01 commented 4 years ago

@goodb Here's an interesting multiple-enablers case: Activated JNKs phosphorylate c-JUN R-HSA-168136

We're asserting here that each of the three members of a set of proteins is capable by itself of enabling the molecular transformation annotated in the reaction. To do that, we've made the set the physical entity attribute of the reaction's catalystActivity, and made each of the set members an activeUnit attribute of that catalystActivity. If we can do this annotation consistently, could it be parsed correctly at the GO-CAM end, i.e., if the physical entity associated with catalysis is a set and the active units associated with that catalysis are all the members of the set, can this be interpreted to mean that set member 1 enables the annotated GO molecular function, set member 1 enables the annotated GO molecular function, ..., set member n enables the annotated GO molecular function?

R-HSA-168162 is another example of this.

goodb commented 4 years ago

@deustp01 erm.. This seems like an inappropriate use of the active-unit slot in your schema. That should really be used to specify which member of a complex is doing the business. In this case, there is no complex, and every member of the Set performs the action directly. If there is some difference between the members in terms of what they are doing there, then they shouldn't be in a Set object.

I would suggest that what you should do is just remove the active unit declaration here. Given the interpretation that a set is basically a shorthand way of saying the same things about each of its components, we would end up saying that each protein in that Set performs the catalysis. Without the active unit declaration, the Set would be processed as you say: MAPK10 enables JUN kinase activity, MAPK9 enables JUN kinase activity etc.

deustp01 commented 4 years ago

@goodb That's better - if the set notation itself forces the correct parsing, I'll remove the redundant / inappropriate activeUnit annotations.

goodb commented 4 years ago

If it doesn't, then its my problem! Either way you are done ;).

deustp01 commented 4 years ago

@goodb - cleanup of events with multiple active units. If I want to assert that either of two members of a protein family M1 and M2 is capable of joining a given complex with some other proteins A, B, and C and functioning as the active unit of a molecular function enabled by the complex, will it work to make a set [M1 or M2], annotate a complex whose components are A and B and C and [M1 or M2], and then annotate the set [M1 or M2] as the active unit?

If that will work, I will use that strategy to patch at least one reaction so far. If that strategy will not work or may come with other risks or difficulties (it looks close to the request you disallowed on Friday), I will simply annotate the complex as mediating the molecular function without specifying an active unit, and leave the issue of capturing the lost bit of information for version 2.

goodb commented 4 years ago

@deustp01 I think that will work, please go for it.

deustp01 commented 4 years ago

Here is a list of all of the reactions with the multiple enabler problem (based on curator-version BioPAX from Jan 8, 2020): Multiple Enablers.txt

I have gone through the list and resolved most of them, mostly by findng a reason to delete all or all but one of the entities listed as an active unit. In a few cases I've had to refer an instance to a curator for action because I was uncomfortable unilaterally resolving it. The Excel file is Ben's list with green color indicating fully resolved (I hope) instances and yellow indicating pending ones. The number in the last column refers to entries in the Word document attached here that give the rationales for what I did in each case. I will now try to turn those notes into a better organized proposal for how to use the active unit slot in Reactome catalyst activity instances for discussion both among ourselves and with Reactome curators. events_with_mutiple_enablers_20200127.xlsx events_with_mutiple_enablers_20200131.docx

goodb commented 4 years ago

@deustp01 it looks like we are pretty close here. When do you think I will have access to a BioPAX export that contains all of these updates?

ukemi commented 4 years ago

@deustp01, when I look at the annotations coming from the reactome gaf in AmiGO2, I only see NSF1 annotated to GO:0031071. NSF1, FXN, ISD1 (LYRM4) and ISCU are annotated to mitochondrial matrix from this pathway. Do you know how NSF1 is chosen from the R-HSA-1362401 complex? Am I looking at the pathway correctly? It seems to me that the complex has the catalyst activity and somehow in the gaf you have identified NSF1 catalytic subunit.

deustp01 commented 4 years ago

@ukemi The released version of this reaction has two catalystActivities, which I think causes the multiple-enablers issue. It has now been re-annotated to give it one catalyst activity (2Iron:FXN:NFS1:ISD11:ISCU [mitochondrial matrix] complex, with activeUnit PXLP-K258-NFS1-1 [mitochondrial matrix] (NFS1 protein covalently modified with pyridoxal phosphate), mediates GO:0031071 "cysteine desulfurase activity"), positively regulated by the physical entity that was previously annotated as the second catalyst. Bruce May, who curated the original event and the patch says it is a pretty good fit to the biology.

That activeUnit annotation for NSF1 is already present in the public two-catalyst version of the event and is correct. The fatal problem for GO-CAM is the second catalystActivity, now removed.

You can see the revised event on our internal site here. It will propagate into our public database, GAF, and BioPax export naturally with our next release in mid-March and should be available as BioPax sooner (@goodb) depending on Guanming's workload. I will ask about that.

ukemi commented 4 years ago

Got it! Thanks @deustp01. I'm still curious as the the origin of the conventional annotation of NFS1 to cysteine desulfurase activity (GO:0031071) and none of the other members of the complex though. This is off topic for this ticket, but I am curious.

goodb commented 4 years ago

Since the March release, recent updates to the conversion code and to the shex schema, we are now down to 8 models that don't validate. All models pass OWL consistency check.

5 of them fail because they have molecular function nodes with more than one enabling entity: http://noctua-dev.berkeleybop.org/editor/graph/gomodel:R-HSA-6782135 http://noctua-dev.berkeleybop.org/editor/graph/gomodel:R-HSA-6807505 http://noctua-dev.berkeleybop.org/editor/graph/gomodel:R-HSA-4615885 http://noctua-dev.berkeleybop.org/editor/graph/gomodel:R-HSA-6785470 http://noctua-dev.berkeleybop.org/editor/graph/gomodel:R-HSA-8868766

3 of them fail because they have a Photon as the enabler of a molecular function: http://noctua-dev.berkeleybop.org/editor/graph/gomodel:R-HSA-2187335 http://noctua-dev.berkeleybop.org/editor/graph/gomodel:R-HSA-2485179 http://noctua-dev.berkeleybop.org/editor/graph/gomodel:R-HSA-2453902

For the latter, the march release seems to have Photon classified as a chemical entity (resulting in the shex violation as chemical entities can't be enablers). However, this seems to have been changed to the real entry for Photon in CHEBI https://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:30212 which will result in it being typed as a subatomic particle - and likely continue to fail the validation for the same reason.
(note that current (April 18, 2020) noctua-dev has an outdated reacto driving it that is making photons into proteins and thus not showing the shex fail).

ukemi commented 4 years ago

I've been in discussion with @deustp01 about the photons. Let's bring it up on the next call, April 29. As for the multiple enablers, we will have to have a closer look a these. Is it because a complex has more than one catalytic subunit? Considering the number of models we have, this really isn't that bad!

deustp01 commented 4 years ago

I'll look today - multiple-enablers should be fix-able on the Reactome side. Photons - as far as the Reactome data model is concerned, photons can be inputs and in fact should generally (maybe always) not be enablers. (Even input has a problem with materiality, but there's no need to compound it by treating them as catalysts where GO-CAM expects genome-encoded informational macromolecules.

deustp01 commented 4 years ago

OK. The three enabling photons have been converted from catalysts to input physical entities. Events with multiple enablers - for local data model reasons, four of the five here may be hard to fix. One looks like a curation mistake that may be quite hard to track down and fix. So, @goodb , would it be possible for the GO-CAM building process to check for flawed enablers (the only surviving flaw should be multiple ones, but new curation could create new kinds), and discard the flawed enabler attribute, note its identifier so we can check it on our side, and build a GO-CAM with what's left? If that check + complain approach is easy to implement, that looks like it would be a fairly sturdy fix.

goodb commented 4 years ago

@deustp01 I think its possible, but not sure on the overall strategy. It seems like a deeper modification of the incoming model than we have done so far. An alternative is simply to report any disagreement with the schema (including this enabler problem and many other potential conflicts that could arise) and the reason, and stop with that. I kind of like that approach as its generally applicable.

Following down that road, we might imagine an option on the converter to delete any relations violating the schema. That would only be important if we had software that cared deeply about the schema and an intention to use these models in that software. I'm not sure that either of those conditions apply right now.

deustp01 commented 4 years ago

If I understand right, the number of remaining problem cases is small and most of the ones we know about require fixes on the Reactome side that are hard to do. We can also expect that new content added to Reactome will come with occasional items that violate GO-CAM rules. I was imagining the remaining problems were a major block to using the models in current applications, and if that's not true then the reports you describe are all we need now.

ukemi commented 4 years ago

"Hard to do" says to me "Maybe not wrong" and we might want to have a look at our rules. But you are correct. it's a small number of cases.

deustp01 commented 4 years ago

"Hard to do" is one of those context-dependent terms. Of the five non-photon cases that Ben flagged on April 18 above, one was a curation mistake (the active unit of a catalytic complex was not a component of that complex - now fixed; visible in June). The other four were black box events each of which describe multiple catalytic steps with known catalysts that occur in unknown order, so multiple enablers are crammed into a single event. This tactic preserves all the gene product - molecular function associations but breaks our logic (harmlessly) and GO-CAM's (not so harmlessly). Even if it were possible, it seems unlikely that a rule change on the GO / GO-CAM side is good here. On our side, we can go back to the biology and look for ways to salvage as much information as possible in a rule-compliant way.

The photon cases look like a nice example of "maybe not wrong" - as long as we don't call them (or anything else that's not a gene product) enablers, but use them as inputs, outputs, and the non-gene-product physical entities that mediate regulation instances, the remaining bit of Reactome - GO disconnect can be ignored safely. (If a small molecule like fructose-2,6-bisphosphate can regulate without breaking anything, then perhaps a photon could do it as well.)

ukemi commented 4 years ago

As we move forward though, we might want to devise a way to get annotations from those black boxes.

deustp01 commented 4 years ago

Definitely - part of the larger discussion of aligning data models better for version 2.

goodb commented 4 years ago

From the conversion code side, I believe we are not planning to make any more changes. There will be future changes on the reactome side, but no need for this issue as these will be ongoing with each release I imagine.

geneontology / pathways2GO

Ensure reactome go-cams validate against main GO shex schema #69