Closed goodb closed 4 years ago
Fingers crossed!
Hit two issues so far.
Here is an example of the latter problem, which is very common.
@vanaukenk in relation to https://github.com/geneontology/go-shapes/pull/160https://github.com/geneontology/go-shapes/pull/160 I am running all of the current conversions through the validator. It looks like about 25% of them do not pass as things stand today. I will catalogue failures here as I diagnose them.
Example 1 http://noctua-dev.berkeleybop.org/editor/graph/gomodel:R-HSA-8852276 There are cases in that one where a function is asserted to 'directly_provide_input_for' a process. Note that the process in this case is not specified (this is the current resolution for transport) e.g. look for "GTSE1 promotes translocation of TP53 to the cytosol"
Same problem in http://noctua-dev.berkeleybop.org/editor/graph/gomodel:R-HSA-168638 'K63 polyubiquitinated RIP2 associates with the TAK1 complex' 'directly_provide_input_for' 'TAK1 is activated'
Another similar example: http://noctua-dev.berkeleybop.org/editor/graph/gomodel:R-HSA-8941333 contains positive_regulates and causally_upstream of links from functions to untyped processes.
more examples:
http://noctua-dev.berkeleybop.org/editor/graph/gomodel:R-HSA-983231 same again.
http://noctua-dev.berkeleybop.org/editor/graph/gomodel:R-HSA-877300
http://noctua-dev.berkeleybop.org/editor/graph/gomodel:R-HSA-5660668
http://noctua-dev.berkeleybop.org/editor/graph/gomodel:R-HSA-9026290
Another problem case. http://noctua-dev.berkeleybop.org/editor/graph/gomodel:R-HSA-1482788
It looks like at least one problem here is the constraint that functions be enabled by InformationBiomacromolecules (or complexes). We have Set entities filling in the enabled_by slot that are only typed as 'chemical entities' (root). Schema is fine here, need to make the typing for Sets better.
In this case the Set is a mix of proteins and one complex. https://reactome.org/content/detail/R-HSA-1524112 maybe goes as a curation problem for Reactome - @deustp01 ?
The conversion sometimes results in functions with more than one enabler, e.g. "PH receptors autophosphorylate" in http://noctua-dev.berkeleybop.org/editor/graph/gomodel:R-HSA-2682334
The schema currently allows only 0 or 1 enablers.
Same for 'Phosphorylation of IRF-3/IRF7 and their release from the activated TLR3 complex' in http://noctua-dev.berkeleybop.org/editor/graph/gomodel:R-HSA-9013973
Interesting again! We are faithfully representing the info.
In the first case, the function is enabled by EPHRINAs or EPHRINBs. Biologically this is correct. Any set will do this, but the restirction is hidden because the set is a single entity. Spot checking the annotations for some of the EPHRIN genes in AmiGO shows that current annotations support these individual gene products enabling the molecular functions.
In the second example, it is also correct. Either entity can enable the kinase activity. (I'm not going to go down the path as to whether this actually represents a process of phosphorylation that is made up of two molecular functions. :) )
However, I thought it was against Reactome curation policy to allow more than one active unit in a reaction. @deustp01? Would it have been more consistent to create a set in these cases: p-S172-TBK1 [plasma membrane] and p-S172-IKBKE [plasma membrane]?
I was actually hoping we could go the route of more than one active subunit to solve the BMP receptor complex issue. If we could enumerate all of the catalytic subunits of the complex when we ditch the contributes_to rule, we could get more info. This skirts the whole issue of 'emergent functions', but do our users really care about that? When they ask for a list of genes that can carry out a function, don't they want all the members of a catalytic complex? @vanaukenk Complexes Complexes Complexes. Another alternative would be to propose that Reactome (and GO curators) somehow capture this info. Then how do we serve it up?
@ukemi @deustp01 it looks like recently introduced changes to the conversion have cured most of the problems above. Out of a sample of 650 converted pathways, only 51 now fail the shex validation test. (So about 92% success rate). I checked 10 of the failures. Of these 10, 3 failures were caused by having multiple enablers on a reaction node and untyped gene product nodes cause 7. Examples of reactions with untyped gene product nodes enabling reactions include: Asymmetric localization of PCP proteins - unknown kinase CS DS degradation.ttl - "beta-xylosidase" R-HSA-2247521 Metabolism of ingested H2SeO4 and H2SeO3 into H2Se - R-HSA-5359054 Glyoxylate metabolism and glycine degradation - R-HSA-6784434 Signal transduction by L1 - Phosphorylation of VAV2 Aflatoxin activation and detoxification - Unknown NAT transfers COCH3 to AFXBO-C, AFNBO-C Vitamin B1 (thiamin) metabolism - TDPK (looks like reaction has been changed in release 70 but still untyped)
When the converter encounters a node without any information beyond "physical entity", as in the cases above, it types it as a chebi 'chemical entity' (which is the root of all physical entities in our system). The shex schema does not like functions to be enabled by anything aside from gene products or complexes.
Thoughts on this problem? I could infer that if an untyped node is enabling a reaction, it must be an information macromolecule and make that assignment when the models are created. Would that be okay?
Interesting, Peter? It might be a better strategy to type these in Reactome. Some look like black-box reactions.
PS. would like Peter to comment too.
If the ratio stays the same, it only looks like there are ~15 pathways with multiple enablers that will require new sets. I predict all the others are black-box reactions.
Yes, David's predictions seem plausible. I will look carefully.
Out of a sample of 650 converted pathways, only 51 now fail the shex validation test. (So about 92% success rate) @goodb could you send me the list of 51 so I can check?
@goodb. Welcome to the world of the in-the-weeds curators! :)
Examples in your emails.
Thanks @goodb. @deustp01 do you want to look yourself, split them up or look together?
I've done the first seven from the list that Ben added to this ticket earlier today.
Two conclusions. First, in all cases the problem is that we have asserted that a protein with no UniProt ID mediates a function defined by a proper GO molecular function term. Our goal is to associate functions with gene products, so this kind of annotation is rare - we should do it only when a reaction is known to happen even though the protein that mediates it has not been characterized enough to allow it to be identified, and we need the reaction to let us connect all the parts of a process together. We in fact have a separate class of physical entity, genomeEncodedEntity, used to annotate such unidentified proteins. So could this class distinction be made into a test to tell the ShEX QA that it's looking at an instance with missing information rather than an incorrect one?
EntityWithAccessionedSequence - the class that holds proteins that do have UniProt IDs - is a child of genomeEncodedEntity. We strictly enforce the rule that an EWAS must have a UniProt attribute, so any member of the GEE class that lacks a UniProt identifier must be a GEE.
Second conclusion / question. As you'll see from the details below, sometimes it looks like a catalystActivity instance is the target of the ShEX xomplaint. This is what I'd expect because this is the class that combines a specific entity with a specific molecular function so this is where unexpected kinds of physical entities should cause problems. But in some cases, reactions or whole pathways seem to have gotten flagged. This is not a big practical problem - as you've already noticed drilling down to the problem entity is easy. I'm just surprized by the irregulasr granularity.
Now the weeds.
Asymmetric localization of PCP proteins----Clearly the unknown kinase should be an enzyme/gene product or complex Yes. In the same way that there is a reactionlike event class that has specific types of reactions as children (so GO-CAM could point to the parent when the exact reaction type is unknown or has no associated GO molecular function term, there is a genome-encoded entity class which has entities with accessioned sequences (EWAS) (i.e., canonical forms of proteins with UniProt IDs) as its major child class. However, it is also legal to create instances of the parent genome-encoded entity class when there is evidence that a protein (alone or as part of a complex) does something but not enough evidence to identify the protein uniquely, as here. How to detect this circumstance? All members of the EWAS class must have associated UniProt IDs – this is rigidly enforced – so any annotation of a macromolecule with no identifier must be an instance of the genome-encoded entity class. Something I don’t understand about this instance is why the pathway is flagged while in problem cases 2 and 3 it’s the individual reaction. It looks like the ShEX issue is at the catalystActivity level in all cases.
CS DS degradation.ttl - "beta-xylosidase" R-HSA-2247521----is a Genes and Transcripts [GenomeEncodedEntity] this should certainly be an information biomacromolecule. Yes, exactly like the kinase in example 1.
Metabolism of ingested H2SeO4 and H2SeO3 into H2Se - R-HSA-5359054----is a Genes and Transcripts [GenomeEncodedEntity] (PAPSe reductase) this should certainly be an information biomacromolecule. Yes, exactly like the kinase in example 1.
Glyoxylate metabolism and glycine degradation - R-HSA-6784434----- This is a black-box reaction. Is it safe to assume it is an information biomacromolecule? peroxisomal glyoxylate carrier [peroxisomal membrane] Yes, exactly like the kinase in example 1. Another granularity oddity. In examples 2 and 3 it looks like the catalystActivity instance (the class that juxtaposes an entity and a molecular function to assert the entity enables the function) appears to be what was flagged by the ShEX script, and that’s right. In example 1, a whole pathway involving multiple binding reactions and one catalyzed reaction got flagged. Here (example 4) the whole reaction got flagged.
Signal transduction by L1 - Phosphorylation of VAV2-----Another black box unknown kinase. Safe to assume it is a biomacromolecule. Yes, exactly like the kinase in example 1.
Aflatoxin activation and detoxification - Unknown NAT transfers COCH3 to AFXBO-C, AFNBO-C--------Another unknown NAT [cytosol] black box. Safe to assume it is a biomacromolecule? Yes, exactly like the kinase in example 1.
Vitamin B1 (thiamin) metabolism - TDPK--------Looks like another enzyme where we know it exists, but don't know what it is. Should be a biomacromolecule. Yes, exactly like the kinase in example 1.
Thanks @goodb. @deustp01 do you want to look yourself, split them up or look together?
I'll start by myself. If the pattern of the first seven holds up, we may already have all the information there is to get.
Apologies @deustp01 I must always remember to be more precise when sending spreadsheets to you.. The notes column is not a specific indication of what aspect of the model failed the shex, it was just a where I was jotting things down as I looked through them so that I had a trail to find my way back.
All of the untyped gene product node cases are molecular function nodes failing against the rule that they can only be enabled by proteins or complexes. Those failures have cascading effects on other nodes connected to the problem case by causal relations.
When I look at the BioPAX for these pathways, I don't see any signs of the genomeEncodedEntity class. Unless there is an accession number, all I see in there is that it is a Physical Entity.
I'll ask Guanming about how / whether kinds of macromolecule physical entities are distinguished in our BioPax export. (No problem about the granularity
If you are looking at the models on noctua-dev we have a new toy for you. When examining a model that you think may be invalid, turn on the reasoner and look for the red boxes. Red boxes indicate nodes that are failing validation. Clicking on the nodes will give you a geeky json representation of what went wrong. (it will be blank if its the 2 enablers problem - known bug). Try it on this one: http://noctua-dev.berkeleybop.org/editor/graph/gomodel:R-HSA-4608870 The reaction in the middle is the culprit. Notice how it also invalidates the reactions with causal relations leading into it.
@deustp01 here is what I believe is a complete list of pathways in Reactome release 69 that contain reactions that are enabled by more than one entity.
I'll ask Guanming about how / whether kinds of macromolecule physical entities are distinguished in our BioPax export.
And the answer is, "For GEE, we hacked BioPAX a little bit by converting a GEE into a PhysicalEntity directly. This is the best approach I could find.The same is used for any Reactome PhysicalEntity instances that don’t have ReferenceEntities assigned."
So I guess for Reactome to GO-CAM conversion via BioPAX, these instances can be found by testing each incoming monomer physicalEntity to determine whether it has a UniProt or ChEBI reference. If not, then it is an empty placeholder, to be handled (? or not?) by the physical entity equivalent of the reactionLikeEvent placeholder for transformations that are out of scope for GO like dissociations.
If they have references, they are typed correctly in the BioPAX and the conversion, so that is handled.
Its a shame BioPax lumps SmallMolecule in with the biological macromolecules as direct children of physical entity. Otherwise we could equate physical entity with information biomacromolecule CHEBI:33695 which I think is the placeholder we are looking for.
I suppose we could hack the biopax again with a comment for the genome encoded entities, but in this case it seems that it is probably not worthwhile to do so as the overall information gained is low.
Could we use the information that these entities are catalyzing reactions to infer that they are biomacromolecules ? Are we safe to assume that Reactome would not record an untyped small molecule catalyzing a reaction?
It's a shame that we can't just equateor create a relationship between a GEE and a biomacromolecule. Would it be safe to do this in the BioPax itself?
Could we use the information that these entities are catalyzing reactions to infer that they are biomacromolecules ? Are we safe to assume that Reactome would not record an untyped small molecule catalyzing a reaction?
The answer is almost always, yes. Here is the one exception I know of: we have overloaded the concept of catalysis to assert that a photon mediates the activation of rhodopsin.
If abuses like this are really rare (I think they are), then they could be eliminated on the Reactome side by re-annotating this reaction and any other where we have assigned an enzyme role to a small molecule. Even if small molecules do catalyze reactions in the body (I think that's really rare), Reactome and GO both want to capture the functions of gene products so such a catalysis is out of scope for both. If either Reactome or GO needs the reaction for connectivity in a process, we can find another way to annotate it.
I don't see a way to do it without going beyond what is in the standard. When we extended it to handle active units, we did so in comments.. Not really ideal.
@deustp01 this is really feeling more like a Reactome curation issue than a conversion issue. In many cases there is clearly more knowledge about the mystery molecules than we would capture by 'information biomacromolecule' - e.g. we know its a kinase etc. I wonder if it would be best to leave the conversion as it is, and push the task of adding references to higher level classes (protein, RNA, complex, small molecule) when exact references are unknown to the reactome curators? e.g. if Photon was typed as SmallMolecule, even without a more specific chebi id, we would be all set.
Face-to-face next week! (Guanming also points to our API as a way to retrieve the disambiguating information from us.)
In many cases there is clearly more knowledge about the mystery molecules ...
I need to look at some cases, working from your spreadsheet. So far, what I see is teasers: we can infer that enzyme-mediated phosphorylation is going on but we can't identify the gene product with the enzyme activity, and I think that leaves us just short of being able to make an assertion that is in scope for either Reactome or GO.
We can make an assertion that there is an enzyme at work and that it is 'probably' catalyzed by an information biomacromolecule (protein or complex). Seems this is a strength in the sense that it is a definite query for something we know exists, but don't know what it is. It is highly unlikely that these reactions proceed spontaneously. Wouldn't you agree @deustp01
@ukemi Yes, agreed. I was more going off on the tangent of information loss to conclude that absent a named gene product to associate with the molecular function, there is not much so a ShEX patch is good if it's safe and easy, but maybe not otherwise.
Examples in your emails.
I've gone through all the instances marked "fail" in the first column of Ben's table from 2 days ago. Here are the results as an Excel workbook. The first sheet is Ben's table with what I found in the comments column - mostly failures because we annotated a protein with no UniProt value. (These are cases where the available data make a good case that a protein enables a GO molecular function but don't allow the protein to be identified.) A few involved a similarly vague or generic small molecule, and in a few cases I couldn't find the problem even with the help of the handy new analyzer feature that puts a red dot on problem things in the GO-CAM display.
Also one Reactome error: we asserted that SRC was the catalyst of a reaction but made a phospho-SRC entity the active unit responsible for catalysis. And a few more cases where a reaction has two enablers, all fixable by creating set instances to hold all the enablers or something else equally straightforward.
In the second sheet, I've taken all the failed rows and resorted them on the reason and notes columns to reinforce the point (already suspected by Ben and David) that there are recurring problems - unidentified kinase and ubiquitin ligase are popular, as is "photon" as a physical entity.
it looks like the last three errors are more of the same:
This is great. The only tricky thing left is the photon.
ChEBI now has a term for photon - 30312 - so it will become a simpleEntity (like glucose, ATP, etc.) in our next release.
Is there a resolution here @ukemi ? We know what the errors are (and pretty cool that less than 8% of all the models had any error and that there appear to be only two kinds of errors). But I'm not sure what the plan is for dealing with them. For me, I'd like to see Reactome make use of an ontology such as CHEBI to annotate the untyped physical entities with upper-level ontology classes that exist outside of their internal schema. If that happened and was reflected in the BioPAX we could handle that one easily in the converter.
Likewise for the multi-entity enabler cases, I think that falls to them to work out whether or not they want to change their models. (There are a total of 47 models like this).
Positive news here is that for the use case of querying the system for the functions of specific genes, neither of these shex violations matters. In the first, there are no genes and in the second, the queries ought to work fine.
For me, I'd like to see Reactome make use of an ontology such as CHEBI to annotate the untyped physical entities with upper-level ontology classes that exist outside of their internal schema.
Actually, nothing in our schema prevents using high-level ChEBI terms. In fact, in cases where we are intending to make a statement about something generic like trypsin enzyme cleaving a peptide bond involving a lysine residue regardless of the rest of the amino acid sequence, we already do that. (Trypsin enables the cleavage of a polypeptide into two peptides.)
As for photon, working through the list of small molecule failures will be a good way to go back and fix the cases where a generic is available (and if it makes chemical sense, to ask ChEBI for any missing generics).
Hi @ukemi and @deustp01 . Wondering what the resolution here ought to be for the untyped physical entity problem. In summary, we see some reactions that are catalyzed by molecules with no entity reference (e.g. no uniprot id). These are typed simply as physical entities in the biopax, which could mean any of {Complex, DNA, DNARegion, Protein, RNA, RNARegion, Smallmolecule}. The conversion process currently types these entities with CHEBI:chemical entity - the root of all physical entities in GO ontology space. The shex schema prescribes that any function/reaction only be enabled by / catalyzed by a InformationBiomacromolecule OR ProteinContainingComplex.
So, to bring these conversions into alignment, we could:
For me, I like 3. if its possible.
?
Option 3 should work. On the spreadsheet reactome_shex_fails.xlsx from two weeks ago (above), all the instances that I labeled GEE can be re-classified as 'protein' CHEBI:36080, and there is a ChEBI instance for photon (now imported into Reactome and used to re-class our photon instances as simple entities (chemicals). One 'other entity - DNA' can be a 'information molecule' CHEBI:33695 and 12 other problems are artifacts of irregular annotation on our side, not problems with the physical entities themselves. That leaves less than five cases that can't be handled.
We could probably find a way on the Reactome side to get that GEE = CHEBI:36080 into the annotation so that Ben does not need another ad hoc patch for the GO-CAM import.
Continuing in that direction, if we can get CHEBI:36080 protein associated with GEEs that are known to be proteins (but we don't know or can't specify which one), the same database hack and logic should let us associate CHEBI:16991 deoxyribonucleic acid and and CHEBI:33697 ribonucleic acid with GEEs that are known to be DNAs and RNAs respectively (but we don't know or can't specify which one). I expect we won't need to go as far as CHEBI:33695 information molecule - at that level of generality / ignorance we shouldn't annotate.
I expect we'd need to add these tags using the cross-reference attribute of the GEE class, so there's the weedy issue of whether this attribute gets into the BioPAX and can reliably be retrieved on the GO-CAM end.
The hack is that CHEBI instances are stored in the referenceMolecule class of Reactome, which is intended to hold the reference versions of small molecules (e.g., leucine, glycine) or precisely specified classes like amino acid, so maybe putting RNA, DNA, and protein terms in that class is overloading. Seems harmless but if this looks like a good way to go I'll check with Guanming.
From the external data consumer view, this would be a helpful increase in the information provided. I can't speak to how it fits into the internal data structure, but hope that you can find a way.
OK. It's on the to-do list. Result will either be a better-defined concern from Guanming or a set of cross-reference instances pointing Reactome GEE instances to CHEBI:36080 protein, in pathways for you to work with.
Hi @deustp01 just checking in here. As I understand it, the ball is in your court for the shex validation issues that we know about right now.
Yes. There was an attack of grant-writing. I'll have time the remainder of this week.
OK. I’ve re-read the ticket, got Guanming’s approval for adding high-level ChEBI terms (protein, chemical entity, etc.) as cross-refs to the otherEntity and genomeEncodedEntity instances, and am getting to work on the actual tagging. otherEntities look tedious but straightforward, genomeEncodedEntities look easier once we figure out how to distinguish them from the referenceGeneProduct instances they are classed with. Indeed, if Ben has or can easily make a list of all problem genomeEncodedEntity instances that will be a useful shortcut (but not crucial – the search can be done here; it just needs developer-level search skills).
I don't think there is a way to distinguish the genomeEncodedEntities from the OtherEntitys in the BioPAX, but here is a list of all the untyped physical entities in the last release. Hope that helps a little.
It does - thanks.
All otherEntity (miscellaneous incompletely described molecular entities) instances now have crossReference attributes that point to something, generally something quite generic, in ChEBI. GenomeEncodedEntity (informational macromolecular monomers that are not well-enough characterized to map to a unique UniProt instance or equivalent) instances remain to be done but with Ben's table from earlier today that should be do-able fairly quickly.
Our plant / Gramene (rice pathway) colleagues, in annotating pathways, have created physicalEntity (otherEntity) instances like "cold temperature" and "long day photoperiod" (which turn up in untyped_physical.txt) together with conventional incompletely specified chemical entities, and used them as inputs or regulators of reactions.
A proper cleanup is not likely to happen any time soon.
@goodb Can I use ChEBI:50906 "role" (A role is particular behaviour which a material entity may exhibit) to create a cross-reference value for such otherEntity instances and will that safely trick the GO-CAM import process into thinking these are valid physical entities for shex purposes? I'm guessing / hoping that this will be OK - it's up to the Gramene people to decide whether this misleads their users and, if so, what to do.
This is a follow-up to the discussion Wednesday (Path2GO call).
Clean-up in Reactome gk_central database should now be done. The problem instances were ones that could not be mapped 1:1 to a UniProt protein or Ensembl gene or RNA transcript or to a ChEBI chemical entity. OtherEntities is a catchall class of entities ("polypeptide", "amino acid") that can be classified to some extent using high-level ChEBI terms (e.g., ChEBI:36080 "protein"). All otherEntity instances in gk_central now have ChEBI identifiers as cross-references.
GenomeEncodedEntities are informational macromolecules (or defined pieces of them created by processing of a transcription or translation product) that can't be mapped unambiguously to an instance in a reference database (e.g., a purified, well-studied enzyme whose amino acid sequence is unknown). Some of these instances are in Ben's list untyped_physical.txt; additional ones were retrieved with a Cypher query of our public database. All have now been supplied with cross-references to ChEBI terms for "protein", "DNA", "RNA", or somewhat more fine-grained ones ("tRNA") when appropriate.
Known limitations. First, as in the comment form earlier today, a small number of the otherEntity instances aren't really physical entities - I hope the added ChEBI terms in these cases will let them pass the shex requiremewnts without corrupting anything else. Second, I tried to be consistent in my use of high-level ChEBI terms but I'm sure there is some variability here. Nothing, I hope, is badly mislabelled and everything should now pass the shex requirements.
Revisit and rework conversion rules or shex schema to bring them into alignment. Close when everything passes. Re-open individual tickets for future failures.