geneontology / pathways2GO

Code for converting between BioPAX pathways and Gene Ontology Causal Activity Models (GO-CAM)
8 stars 0 forks source link

Convert chemicals to ChEBI rather than Reacto #221

Closed ukemi closed 1 year ago

ukemi commented 1 year ago

This came up during the QC checks of the last release. The initial discussion is pasted below. Also see the discussion from: https://github.com/geneontology/pathways2GO/issues/176#issuecomment-1280744188 onward.

http://noctua-dev.berkeleybop.org/editor/graph/gomodel:R-HSA-1474151

Yes, I see in the BioPAX that input [PTHP](https://reactome.org/content/detail/R-ALL-1474179#Homo%20sapiens) has CHEBI:17804 xref'd to its entityReference (in the External Reference Information section). However, the conversion code currently doesn't look at xrefs from entityReference elements on a SmallMolecule object and instead just uses its Reactome ID. Same with output [sepiapterin](https://reactome.org/content/detail/R-ALL-1497811#Homo%20sapiens) and likely every other small molecule in Reactome GO-CAMs. We can open a ticket to change this behavior to always fetch the CHEBI if that is desired.
For the enabled_by (I think this is the real ShEx violation), sepiapterin synthase (R-HSA-9693721) is in the BioPAX as a PhysicalEntity. See its [entry](https://reactome.org/content/detail/R-HSA-9693721) at Reactome and notice it does not have a CHEBI cross reference. As a result, in reacto.owl, R-HSA-9693721 only has subClassOf continuant, which is not specific enough to be inferred as either InformationBiomacromolecule or ProteinContainingComplex.
PD comment Agreed - the sepiapterin synthase (R-HSA-9693721) genome encoded entity has neither a UniProt reference link nor a crossReference to [ChEBI:36080](https://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:36080) "protein". In contrast, for example the [MHDB decarboxylase (R-HSA-2167848)](https://reactome.org/content/detail/R-HSA-2167848 does have a ChEBI:36080 crossReference and its pathway [Ubiquinol biosynthesis](https://reactome.org/content/detail/R-HSA-2142789) (R-HSA-2142789) yields a GO-CAM with no SHEX error. Now patched in Reactome for the ver 83 release. Bottom line: a Reactome curation mistake that occurred after the previous clean-up of physical entities with no acceptable link to UniProt or ChEBI. Again, something easy to flag and fix during the hypothetical future one-week clean-up period.
) genome encoded entity in pathway
deustp01 commented 1 year ago

N.b. Once we are sure that all chemicals do have ChEBI IDs, need to clean up REACTO to remove chemical IDs Also need QA check and a way to automatically convert items like David Hill's manual mouse GO-CAMs that have REACTO chemicals in them. Changes to ShEX to fix this. Long term goal - retire REACTO. Short term process - retire parts of it as possible. Chemicals here, proteoforms soon. Link to Ben's global ticket https://github.com/geneontology/noctua-models-migrations/issues

ukemi commented 1 year ago

Separate from the missing (unresolving) ChEBI identifiers that were already spotted. I need a way to check the integrity of the identifiers that are resolving. The best method that I can think of is to open the model in the graph editor, output the GPAD and cross check the label in the graph editor with the Chebi identifier in the GPAD and then cross-check those with Chebi. I will also check them with respect to the cross references in Reactome. This is a labor-intensive manual process, but I think it is necessary to ensure that things happened correctly. I will start a spreadsheet and link it to this ticket. I'm not sure how many I will check, but will look at several different pathways and several different kinds of reactions.

nataled commented 1 year ago

@ukemi that shouldn't have to be done manually. I can probably whip up a way to check automatically once given the GPAD. This is not to stop any manual work that could proceed while the automated check is in progress (that's my usual procedure anyway).

ukemi commented 1 year ago

That would be awesome @nataled! @dustine32 do you know if there are products generated from the development server? If so, is there a GPAD that @nataled could use? Even if it is a mega-file, it would be straightforward to filter on annotations from Reactome models. We might actually want to put something like this in place beyond just for this project.

deustp01 commented 1 year ago

I need a way to check the integrity of the identifiers that are resolving.

Item for Monday "weeds" - what exactly are the integrity problems (wrong charge states of ionizable compounds? other?)? In principle, this is really a Reactome curation integrity issue: we should only be using correct ChEBI instances in the first place, so the follow-on question is how to change Reactome curation and QA practice to fix them at the source. And, as suggested on Wednesday, get rid of ChEBI terms used to identify polynucleotides where SO terms would work. And, probably, also identify classes of ChEBI instances that Reactome needs to annotate weird cases - perhaps we really need those electrons and photons - to add to Jim's list of ChEBI terms legal for GO-CAM.

Your spreadsheet will be a good resource for starting to sort this out.

ukemi commented 1 year ago

It's way simpler than that. In my own worrisome way, I just want to make sure that the process worked. That is, when I see a chemical in a model, it is the chemical that was in the original Reactome pathway and the label on the chemical is correct though the chain of Reactome ID->ChebiID in Reactome->ChebiID in GO-CAM->GO-CAM graph label.

ukemi commented 1 year ago

I did a bit of this today and I am convinced that the integrity of the information being transferred is intact: https://docs.google.com/spreadsheets/d/1-NxsN6eVxxWuAGuH9tGX0W2FMi3mc90AodOOxaOP7sY/edit#gid=0 I looked at the ones in the spreadsheet in detail, but also checked other reactions in the pathways.