geneontology / pathways2GO

Code for converting between BioPAX pathways and Gene Ontology Causal Activity Models (GO-CAM)
8 stars 0 forks source link

Remove (don't load) models that are a single node #202

Open ukemi opened 1 year ago

ukemi commented 1 year ago

In some cases, a Reactome pathway doesn't have any reactions that are directly associated with it. Instead it has a collection of subpathways under it. In those cases, the parent gets imported as a single node with nothing else associated with it. We should not load these. eg R-HSA-71291

deustp01 commented 1 year ago

Here's a weedy suggestion for how to proceed. Agreed that a pathway with no content except subpathways yields a GO-CAM with no informative content, but before simply discarding these it would be prudent to get a list of all the single-node pathways for manual inspection to confirm that nothing is lost. I expect that everything on the list will be OK for removal - even where a curator has made one of these as a placeholder and plan to fill in individual reaction children along with the pathway children, when and if that happens the pathway will then pass the rule proposed here and will get loaded OK.

And a naive question. Is the Reactome event hierarchy somehow preserved in the exported GO-CAM structure? I guess that it is not and in that case, these empty grouping pathways do not have a useful linking role.

ukemi commented 1 year ago

And a naive question. Is the Reactome event hierarchy somehow preserved in the exported GO-CAM structure? I guess that it is not and in that case, these empty grouping pathways do not have a useful linking role.

This is a great question that has now set me thinking. The Reactome hierarchy is not preserved because of the inability to discriminate is_a and Part_of (another thing that I think we could brainstorm about at a face-2-face. I think you had some good ideas about this). However, let's say that there is a Reactome pathway that has no reactions, but only pathways as children. If the parent pathway has an asserted GO BP term mapped and none of the children do, it would be safe to put the parent pathway on generic children. It doesn't matter is the child is a subclass or a part of the parent because we won't represent that. The parent BP will just go to the new top node of the model. I'm not sure how many of these exist, but I think I've seen some.

To follow up, Peter sent an e-mail to Guanming:

On the pathways2GO side, it would be really useful to make this distinction – for example, “Glucose metabolism” is_a “Metabolism of carbohydrates”, and “Glycolysis” is_a “Glucose_metabolism”, but both the pathway “Regulation of Glucokinase by Glucokinase Regulatory Protein” and the reactionlikeEvent “HK1,2,3,GCK phosphorylate Glc to form G6P” are parts_of “Glycolysis”. Right now all pathways are connected to their contained events by a single relation, hasEvent. At the level of the data model, how hard or dangerous would it be to replace this single relation with two, so that pathways can have either is_a or part_of relationships or both to their contained events? If this change at the level of the data model seems OK, then we can begin to think about how to handle the legacy clean-up of existing pathways. This will certainly be a very big job and if the data model change is OK, then David and I can work with curators to look for ways to make it as easy as possible. I guess / hope that most pathways will contain only one kind (is_a or part_of) children but we will need to look very carefully.

who replied:

If you recall, is_a relationship existed originally in our old data model, probably about 15 years ago or longer. At certain time, in order to keep our model simpler, basically we lumped both has_a (called hasComponent if I remember it correctly) and is_a (isMember?) relationships into this hasEvent slot. Now hasEvent is overloaded with both meanings. It is doable to spin off hasEvent into another isEvent relationship for some containing pathways. However, this may bring in a lot of headaches for both visualization (e.g. showing isA pathways and reactions container there differently from hasA pathways) and data analysis (e.g. pathway enrichment analysis: how to split isA pathway from other). So it is really a can of worms.

to which Peter replied:

One idea, not really worked through, to mention and add to the GitHub ticket for future discussion before I forget. The first suggestion was “top-down”: annotate a parent event to indicate kinds of children. That breaks current Reactome web displays and data mining as Guanming said. A fairly clunky alternative might be “bottom-up”: an event has an optional slot to indicate the pathways of which it is an instance and another to indicate the pathways of which it is a part.

Guanming:

The bottom-up approach may still bring us a quite of lot to handle at the Reactome side in the perspective of software tools: 1). How to visualize newly added is_a pathways in the web site, how to exclude or include them for data analysis (e.g. gene set enrichment analysis), how to export them in other formats (e.g. gene sets for MSigDB, NCBI, BioPAX, etc).

Peter:

I’m imagining that the bottom-up annotations of reactions would be in addition to the top-down “hasEvent” annotations of pathways, and data analysis and web layout tools could ignore them. We would need to capture them in BioPAX, though.

Guanming:

To the best of my knowledge, I don’t think BioPAX supports is_a relationship or make distinguishing between has_part and is_a. One way we may try is to use GO’s is_a and is_part relationships by overlapping them onto Reactome’s events.

deustp01 commented 1 year ago

Now deferred from things to do in connection with GO-CAM build from Reactome 82 - make this a headache for another day

ukemi commented 1 year ago

QA for @ukemi and @deustp01. Once this is done, it should only eliminate the grouping pathways that have no reactions associated with them.