geneontology / pathways2GO

Code for converting between BioPAX pathways and Gene Ontology Causal Activity Models (GO-CAM)
8 stars 0 forks source link

Not all drug reaction elements are deleted from model #264

Closed dustine32 closed 9 months ago

dustine32 commented 1 year ago

This originates from https://github.com/geneontology/pathways2GO/issues/204#issuecomment-1494814740.

In Reactome pathway R-HSA-204174, small molecule glutathione (GSH) is just floating, orphaned from its original reaction R-HSA-9011595 "GSTZ1 dimer dehalogenates DCA to glyoxylate", which was deleted because it is a drug reaction. image

Some background: The existing conversion code seems to eagerly emit all reactions and their connected elements (GPs, small mols, GO terms) regardless of whether or not they are drug reactions (intended to be filtered out). The code then iterates back through the reaction list to attempt to delete drug reactions and all of their connected elements from the OWLAPI ontology obj representing the working pathway model. This works well enough for most connected elements (e.g., controller, inputs/outputs, location CC terms) because it recursively traverses the built graph through typical relations enabled_by, has_input, has_output, and occurs_in while elements connected via more complex logic and different relations are ignored by deletion. The example reaction above results in the edge glutathione -involved_in_positive_regulation_of-> [R-HSA-9011595 reaction], which of course is then ignored by the deletion code.

One proposed solution to this is to simply add this involved_in_positive_regulation_of relation (maybe all regulatory relations?) to the list of relations that the drug deletion code can traverse but I'm suspicious this will result in the deletion code leaking to the rest of the model and removing elements we intended to keep.

The other solution (which I prefer) is to clean up the flow logic a bit and simply not create/emit any drug reaction elements to the working pathway model in the first place. This removes the need to fine-tune the downstream deletion logic as well as prevents the drug reaction elements from influencing the inference rules that often look at two or more reactions together (this might bite us but we will see).

ukemi commented 1 year ago

I am fine with filtering at an earlier step. But just in case we ever wanted to include the drug reactions in a later iteration of this work, is it possible to write the filter in such a way that it could be ignored? @deustp01 are you ok with this?

dustine32 commented 1 year ago

@ukemi Oh yes, good point! We can add an option to -include_drug_rxns so that the default behavior is to filter these out but they can still be included on demand.

deustp01 commented 1 year ago

Sounds good - nothing to add form here.

dustine32 commented 1 year ago

@ukemi @deustp01 Sort of related to the bug here, I've tried testing the -include_drug_rxns option and it exposes an issue with some ID prefixes including spaces when converted to an IRI. Specifically, "Guide to Pharmacology_4518" for DCA in our example model R-HSA-204174. I believe we should really be extracting the ChEBI ID (CHEBI:28240) for this but it's in a different "BioPAX place" than expected by the code (it's in a RelationshipXref of the SmallMolecule entity instead of a UnificationXref of the SmallMolecule entity's EntityReference).

For DCA and perhaps other drugs, should we be using the ChEBI ID if available?

deustp01 commented 1 year ago

For DCA and perhaps other drugs, should we be using the ChEBI ID if available?

This looks do-able in principle (ChEBI should either already have or be able to create instances for all or drugs) but a horrible legacy clean-up because of our 1186 referenceTherapeutic instances, only 13 now have ChEBI ID's as crossReference slot values

Meanwhile 1085 have Guide to Pharmacology identifier values and 85 have PubChem identifier values. Our curation guideline (and also our collaboration with the GtoP team) mandates use of GtoP identifiers whenever possible, and GtoP in turn is willing to coin new ones for us as needed. Would it be helpful to get GtoP identifiers for all of our referenceTherapeutics? That's probably easier than getting and applying ChEBI ID's. Or does this miss the point - am I hacking around a problem or evading it rather than solving it properly?

ukemi commented 1 year ago

Ultimately, this is a decision for @deustp01, but I do notice that Reactome sometimes has both the "Guide to Pharmacology" and a ChEBI identifier for the drugs I looked at (for example in the aspirin pathway). Sometimes they don't (Azathioprine pathway. As long as we can still identify them as drugs, I think the ChEBI identifier being primary would keep things parallel with our other data representation. @deustp01, do you systematically try to assign ChEBI identifiers to drugs? I notice that in R-HSA-9751051, the 6MP has a ref to GtP 7226 and the ChEBI id 50667 is not xref'd.

ukemi commented 1 year ago

Sorry @deustp01! I think we were typing at the same time. You answered my question.

deustp01 commented 1 year ago

do you systematically try to assign ChEBI identifiers to drugs?

Posts crossed - no, it's the reverse: GtoP is mandated and ChEBI is optional.

deustp01 commented 1 year ago

a horrible legacy clean-up

There may be a workaround. I just checked a very small sample of GtoP pages, the one for aspirin and see that in the cross-reference section of that page they give the mapping to the ChEBI ID, so perhaps we can mine the information from there to populate our referenceTherapeutic instances with ChEBI IDs or to create a look-up table to be used at the stage of GO-CAM generation? (My hunch is that populating Reactome is the correct way to do it but populating GO-CAMs is easier for Reactome.)

ukemi commented 1 year ago

Also keep in mind that at the moment, we still say 'NO' to drugs. So this should get lower priority than the other bugs.

dustine32 commented 1 year ago

@deustp01 The (very) few drugs I've checked on Reactome all seem to have the ChEBI ID in their cross-references section. image image I think if this is consistent for drugs I can add some logic to dig this out, no lookup table required.

@ukemi I'll just commit the orphan fix I have right now along with my ChEBI hack even if the hack is never used.

deustp01 commented 1 year ago

As far as I can tell, referenceTherapeutic instances with ChEBI crossReference slot values are very rare (10 of 1037 instances):

Screenshot 2023-06-08 at 8 56 26 AM

Aspirin is indeed one of the rare ones:

Screenshot 2023-06-08 at 8 51 51 AM
dustine32 commented 1 year ago

@deustp01 @ukemi I just came across the drug carbovir monophosphate in pathway R-HSA-2161541. This is recognized as a drug in Reactome but does not have either a "Guide to Pharmacology" or an "IUPHAR" prefixed xref - instead, it has PubChem Compound:135565291. (Again, the space in the prefix here is killing the OWL writer)

Should I add "PubChem Compound" to the list of "is a drug" prefixes?

ukemi commented 1 year ago

@dustine32 Do you mean that all PubChem Compounds will be classified as drugs or things classified as drugs are allowed to have PubChem xrefs? I think the latter is certainly fine. @deustp01 do you know if there are 'physiological' chemicals that have a PubChem Compound xref in reactome? I found some pubchems, but am probably searching very inefficiently.

deustp01 commented 1 year ago

do you know if there are 'physiological' chemicals that have a PubChem Compound xref in reactome?

I don't know what we have actually done, but historically we allow use of PubChem Compound instances as xrefs in cased where no ChEBI instance is available, and while we have tried to clean these up by getting needed ChEBI instances, I don't know if we have completed the job. Equally important, there is nothing to prevent someone from annotating a fascinating chemical tomorrow that is unknown to ChEBI.

ukemi commented 1 year ago

So I don't think we can assume that all PubChem identifiers refer to drugs. Actually, I found some PubChem, but no other PubChem Compound cases.

deustp01 commented 1 year ago

it has PubChem Compound:135565291. (Again, the space in the prefix here is killing the OWL writer)

Are internal spaces in names always fatal to OWL? (I guess they are.) If so, are underscore characters OK, or would it be better to run the words together with capitalization? I.e., which is better for you, "PubChem_Compound", "PubChem Compound", or "PubChemCompound". At various places in Reactome, we use all three of these, as well as other characters like "(", as shown in our list of Reference database instances, but I suspect that it would not be too hard or too dangerous for us to standardize on one scheme, especially if this would bring us into compliance with an existing stable widely used community standard.

ukemi commented 1 year ago

Naive question, but what's the difference between just PubChem and 'PubChem Compound'? Is it just due to inconsistency?

ukemi commented 1 year ago

Never mind. I followed your link and answered my question.

deustp01 commented 1 year ago

PubChem and 'PubChem Compound

PubChem comes in two sections, PubChem Compound and PubChem Substance. In my mind they are sort of like SwissProt (manually curated are more or less uniquified) and TrEMBL (automatically compiled so more extensive but with no guarantee of uniqeness or consistency), respectively, so as a matter of editorial policy we should use PubChem Compond but the same resource gaps that push us to any kind of PubChem instead of ChEBI can also push us to PubChem Substance. And in fact I see 6 uses of PubChem Substance as a reference database for a reference therapeutic.

We should never use just plain "PubChem".

dustine32 commented 1 year ago

@ukemi I guess I simple-mindedly suggested that all "PubChem Compound" entities might be drugs but that's quite an assumption. Basically, I need something from this carbovir monophosphate class in the BioPAX to tell the GO-CAM conversion that it's a drug.

deustp01 commented 1 year ago

tell the GO-CAM conversion that it's a drug.

Apologies for the senior moment - we must have discussed this earlier in the "say no to drugs" process, but why doesn't the "type | chemical drug" attribute do the job. We probably need to say "no" to a few things that lack this type attribute but I expect we will want to say "no" to all that do have it?

nataled commented 1 year ago

@deustp01 from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4702940/

"PubChem consists of three inter-linked databases, Substance, Compound and BioAssay. The Substance database contains chemical information deposited by individual data contributors to PubChem, and the Compound database stores unique chemical structures extracted from the Substance database. Biological activity data of chemical substances tested in assay experiments are contained in the BioAssay database."

I agree that Compound is the right one to use, if possible.

dustine32 commented 1 year ago

I.e., which is better for you, "PubChem_Compound", "PubChem Compound", or "PubChemCompound".

@deustp01 Actually, I see "PubChem_Compound" is already used in the go_context.jsonld prefix file: https://github.com/prefixcommons/biocontext/blob/8f2b3226d4a5e5151fd7940ca1f775b5b767fe74/registry/go_context.jsonld#LL149C10-L149C26

So "PubChem_Compound" would maybe be preferred as it's less likely to cause ID resolving issues.

"type | chemical drug" attribute

@deustp01 Oh let me see if this is in the BioPAX. I assumed it was not.

dustine32 commented 1 year ago

@deustp01 Oh let me see if this is in the BioPAX. I assumed it was not.

Nope. I can't find anything like a type = ChemicalDrug or any text for "Drug", "drug", or "DRUG" in the BioPAX for this pathway. @deustp01 Is there some field that can be used in the BioPAX to carry this type info?

dustine32 commented 1 year ago

Tagging @adamjohnwright on this thread.