geneontology / pathways2GO

Code for converting between BioPAX pathways and Gene Ontology Causal Activity Models (GO-CAM)
8 stars 0 forks source link

Reactome Release #85 #215

Closed ukemi closed 11 months ago

ukemi commented 1 year ago
ukemi commented 1 year ago

@dustine32 @deustp01 I have spent the day when not in meetings QCing the models wrt the tickets we decided to cover. I'm willing to give this a thumbs up with the caveats in mind of the reactions that are ouside the main pathway. I'm going to toss this back into your court @dustine32 for the Shex checks.

dustine32 commented 1 year ago

@ukemi Thank you for the quick testing!

Here is the ShEx report for this run: main_report.txt

Looks like 161 models fail so hopefully it's an obvious systemic cause that's easy to fix.

ukemi commented 1 year ago

Checking the file:

  1. No logical errors
  2. 161 ShEx violations investigated here: https://docs.google.com/spreadsheets/d/1nNl0DtKTKB9zcg1359CKN3wYoS6Tmxxll3XaDatDkW0/edit#gid=0 NB in the March 8 report we had 175. Slight improvement, but improvement nonetheless. In fact, these numbers may be off. Several models pass the shex check in Noctua. See the spreadsheet. Is it possible that your check has something slightly out of date @dustine32
  3. Common error. Unresolved ChEBI identifiers. @deustp01 we should take a closer look at these. The failure to resolve can have two possible explanations: they haven't been loaded yet or they are not the correct 7.3 species.
deustp01 commented 1 year ago

Say NO to drugs: Row 19 in the Main report spreadsheet

null | APAP ADME - imported from: Reactome | http://noctua-dev.berkeleybop.org/editor/graph/gomodel:R-HSA-9753281 | development | [GOC:reactome_curators] | [https://reactome.org] -- | -- | -- | -- | -- | -- ... is a drug metabolism pathway, as are all children of the "[Drug ADME](https://reactome.org/PathwayBrowser/#/R-HSA-9748784)" superpathway. Someday, even if drugs remain out of scope for GO-CAM, we may want to see if useful in-scope xenobiotic metabolism can be mined from these pathways but for now, I think it would be OK to exclude the entire "Drug ADME" superpathway from the collection used as inputs for GO-CAM, in the same way that we exclude all children of the "Disease" superpathway.
ukemi commented 1 year ago

Interesting dilemma that we should discuss. GO does consider response to drugs in scope at the moment, eg response to cisplatin. Would this fit into acceptable then? It's not the action of a drug. I'm not sure what the current plan is. I do think at some point someone is going to see that the drug pathways will have a use, even if it's outside of the scope of GO.

dustine32 commented 1 year ago

Is it possible that your check has something slightly out of date

@ukemi Yeah, it could be I need to update minerva or it may be using a cached ShEx spec. I'm taking a look now!

dustine32 commented 1 year ago

@ukemi Ok, (sigh) it was an older (by a few days) version of the ShEx spec causing issues. Specifically, the has_small_moleculer_activator-type relations weren't in it.

I'm rerunning the checks after also updating minerva and will have new, better results soon!

deustp01 commented 1 year ago

GO does consider response to drugs in scope at the moment, eg response to cisplatin. Would this fit into acceptable then?

True - drug "ADME" (absorption distribution metabolism excretion" may well be in scope for GO. But if we go in that direction, children of GO:0042221 response to chemical "Any process that results in a change in state or activity of a cell or an organism (in terms of movement, secretion, enzyme production, gene expression, etc.) as a result of a chemical stimulus," may be less good than children of GO:0006805 xenobiotic metabolic process "The chemical reactions and pathways involving a xenobiotic compound, a compound foreign to the organism exposed to it. It may be synthesized by another organism (like ampicillin) or it can be a synthetic chemical." The definition of GO:0042221 to me implies phenomena in the realm of physiology / homeostasis - how does the organism restore a steady state after a chemical perturbation, while GO:0006805 lets us focus narrowly on metabolic processes that occur in response to introduction of a foreign substance, maybe with some help from children of GO:0042908 xenobiotic transport.

This sounds like a strategy question and maybe of broad enough interest to be discussed on an ontology call. Also scope creep for this ticket - maybe a new "say maybe to drugs" ticket?

deustp01 commented 1 year ago

I'm rerunning the checks after also updating minerva and will have new, better results soon!

Will this have any effect on the lists of unexpected ChEBI instances in the current version of the main table? Or (I guess) is this a separate issue, and all those ChEBI's still need to be sorted out?

ukemi commented 1 year ago

It should not sweep the violators under the rug. It should only filter out the models that I see passing the shex when I run it 'live' from Noctua. The ones with chemical violators still come up as 'Invalid'. @deustp01, is the column I'm filling in on the spreadsheet useful to you? I'm doing some other QC today, but if it is useful I will continue to fill it out.

dustine32 commented 1 year ago

@ukemi @deustp01 Here is the new, fixed ShEx report: main_report.txt

Only 86 fails!

@deustp01 Right, the unrecognized ChEBI classes are a separate issue and shouldn't change with the fixed ShEx report.

deustp01 commented 1 year ago

Here is the new, fixed ShEx report: main_report.txt

I'm confused. The previous "Main Report June-22-2023" spreadsheet has a column (column J) "suspected reason for failure". In the latest version main_report.txt, this column is gone. But this is the column I was using to get the ChEBI IDs of the molecules that need to be investigated to find wrong charge states and stereochemistry, and stealth drugs. Should I continue to work from the June 22 spreadsheet?

Also, @ukemi, what version of the spreadsheet are you working on and what column are you putting comments in? Or are those comments things like the entry in row 5 of the June 22 spreadsheet "This one passes when I run the reasoner in Noctua?????", in which case I'm already looking in the right place.

For now, I will work from ChEBI numbers in column J of the June 22 spreadsheet and build my own document that lists each ChEBI ID, what Reactome pathway / GO-CAM model it is associated with, what problem I found, and how I resolved it in the Reactome central database. OK?

ukemi commented 1 year ago

That's the column I added and was editing by hand. I guess that answers my question about whether it was useful. :)

ukemi commented 1 year ago

@deustp01, I'll continue to work in the old spreadsheet, but will use the new one as a guide. If the column is blank, it passes and is excluded from the new run. Does that make sense?

deustp01 commented 1 year ago

Does that make sense?

Yes, definitely. Also, do I remember right that not all ChEBI instances are included in NEO, but only a subset of plausible ones? So that wouyld explain row 2 in the table, where we annotated a weird small molecule product, CHEBI:142614 - 5-guanidinohydantoin, generated when a modified base is removed from damaged DNA and the fix will be to add CHEBI:142614 to the list of permitted small molecules?

ukemi commented 1 year ago

You remember correctly. Chemicals are only loaded if up until this point we needed to use them. So I suspect the failures will be of two flavors: ones that are correct but we never needed them before and ones where the charge state/etc is wrong and the correct form is available. Once you have vetted the violators, we will send a list of the needed chemicals to @balhoff so we can include them in the ChEBI load.

ukemi commented 1 year ago

@deustp01 I'm getting through the list. I stopped at the Lewis Blood group today. I should be able to finish the rest by tomorrow before I head out.

deustp01 commented 1 year ago

@ukemi I'm at row 69 - KEAP1-NFE2L2 pathway - many, many small molecules with wrong charges where we will need to ask fir a new ChEBI instance to get it right, as well as many cases where I expect the needed ChEBI is not on Jim's list. Is there an easy way to check that list ourselves? I'm keeping my notes so far in an excel spreadsheet on my laptop just because that makes moving between windows easy. If it's helpful, I could make my results so far into a third sheet of the "Main Report" Google doc, and add new rows as I progress.

ukemi commented 1 year ago

Kind of a hacky solution, but if you go to the Noctua interface and go to a model, you can just go to a Reactome one. Click on add individual over at the left. Enter the ChEBI identifier and see if it autocompletes to the chemical. If it does, then NEO knows about it. If it doesn't then we will have to add it.

ukemi commented 1 year ago

@deustp01 It's more work, but are you also keeping Rhea in the loop with these? We might as well make sure that that coordination stays in alignment or it may come back to haunt us.

deustp01 commented 1 year ago

Are you also keeping Rhea in the loop with these?

My fantasy is that corrections propagate into the Reactome released database, yielding reactions that now match Rhea, and these matches propagate into Rhea. The reality is that the first step happens reliably, and matching to Rhea and propagation of the match into Rhea are do-able. We (you, me, Dustin) need to have a conversation with Adam Wright to figure out how to get the do-able stuff to happen, maybe bringing in Alan Bridge from Rhea to help with figuring out what we need to export for Rhea to pick up.

In a big majority of the charge state problems I'm finding, ChEBI doesn't even have a term for the pH 7.3 form - ChEBI made the existing term in response to a pH-ignorant request from us - so a first step will be to get ChEBI terms.

Onwards!

deustp01 commented 1 year ago

Once you have vetted the violators, we will send a list of the needed chemicals to @balhoff so we can include them in the ChEBI load.

This is much better than checking them ourselves in Noctua, guessing that processing such a list is easy for @balhoff . Any requests for what goes in the list besides the ChEBI ID, and for how it's formatted beyond one ID per line of plain text file?

ukemi commented 1 year ago

Makes perfect sense. I still would love to see the 3-resource alignment. I just finished the list. There we a couple models that still seemed to pass when I ran the reasoner on my end. I'm going to try to figure out what's happening with the transporters and the shex violations. It looks like some things are failing because chemicals aren't being recognized as chemicals. They are resolving to chebi identifiers that are recognized, but still throwing a violation. Just so we have a record, here is how I check the things on this report:

  1. Go to the model in the graph editor.
  2. Run the reasoner and check for the Invalid classification.
  3. Scan across the model for individuals with the red box in the upper left corner.
  4. If I see a chebi identifier that doesn't resolve, I list that as unresolved in the spreadsheet. That means it's not recognized by GO.
  5. If the problem is something else, I click on the red box and look at the explanation of the violation. I'm getting pretty good at figuring them out. Ben would be proud of me.
  6. Just to be absolutely sure, I modify the model by deleting the offending entity. I keep the reasoner on. If the red box goes away, I know I'm right. I revert the model back and save it.
ukemi commented 1 year ago

This is much better than checking them ourselves in Noctua, guessing that processing such a list is easy for @balhoff . Any requests for what goes in the list besides the ChEBI ID, and for how it's formatted beyond one ID per line of plain text file?

They get added to an import text file in this format: http://purl.obolibrary.org/obo/XXX_0000001 ## Optional Label eg. http://purl.obolibrary.org/obo/CHEBI_64835 ## 1,6-kestotetraose

We can probably just open an ontology ticket and ask an editor to do it. I haven't done anything like this in so long, I'm a bit unsure of myself. Plus opening the ticket and tagging it would create a record of our work.

deustp01 commented 1 year ago

I still would love to see the 3-resource alignment.

Complete agreement here - this is essential, so all the maneuvering is aimed at using us effectively to get there.

deustp01 commented 1 year ago

In a big majority of the charge state problems I'm finding, ChEBI doesn't even have a term for the pH 7.3 form - ChEBI made the existing term in response to a pH-ignorant request from us - so a first step will be to get ChEBI terms.

I've finished my review, summarized in two new worksheets added to the main report Google doc. The first sheet lists each GO-CAM, in the order they are listed on the first worksheet, for which there were ChEBI-related issues, with a separate row for each ChEBI instance that includes my guess as to what's going on. I expect that in most cases, the ChEBI IDs are not on Jim's list to build into NEO. But many of these IDs, as noted, also now point to molecules that are not in their correct pH 7.3 charge state. My opinion is that rather than temporarily populating NEO with incorrect ChEBI molecules to support current Rhea-noncompliant Reactome models, we should fix Reactome. A complication is that in most cases ChEBI does not yet have an entry for the pH 7.3 form of the molecule, so a first step will be to get the needed ChEBI instances, then fix Reactome and map the fixed reactions to Rhea, then fix NEO. In the second added worksheet, I sorted the first one on the ChEBI ID column and edited to get a list of all 153 ChEBI IDs we are concerned with - substantial work, but realistic to get done perhaps even in time for the next Reactome release if ChEBI IDs are easy to get. @ukemi @dustine32 sanity check please.