geneontology / pathways2GO

Code for converting between BioPAX pathways and Gene Ontology Causal Activity Models (GO-CAM)
8 stars 0 forks source link

Test load of Mouse and Fly Reactome Pathways #268

Open ukemi opened 1 year ago

ukemi commented 1 year ago
kltm commented 1 year ago

@ukemi About how many new models are we talking about?

ukemi commented 1 year ago

Good Question. No more than the number of existing ones times 2. Perhaps @dustine32 could give a more precise answer.

deustp01 commented 1 year ago

A strict upper bound is the same number for each species as we now generate for human. The actual number will be smaller wherever our script can't find any model organism proteins at all to try to infer the counterpart of a human model, but my best guess is that not many models will be lost that way.

kltm commented 1 year ago

@dustine32 Can you check my number with this? We have 43k models. Eyeballing, we have 2k Reactome models. We have a 2x upper bound, meaning that we have order 4k new models we're talking about. That's order 10% increase, which, while unlikely putting us over any limits (the only way to find out is live testing) is also a non-trivial increase, adding drag to pipeline processing and Noctua. I believe we should be fine here (proof will be in the testing), but 10% here and 10% there we're going to start running into issues eventually.

ukemi commented 1 year ago

1879 models from Reactome on production. So at most 3758 models added.

deustp01 commented 1 year ago

The megalomaniacal projection is one set of human-derived GO-CAMs for every Alliance model organism. In the real world, it will take a while to get there (more like years than months) and many of the GO-CAMs, especially for species distant from human, will always be incomplete (as this has been defined for the Reactome-derived human GO-CAMs).

A reasonable expectation is that the major use of these models will be as templates that will be reviewed and edited by curators to generate GO-CAMs suitable for public release.

dustine32 commented 1 year ago

@kltm Your numbers sound right. ~2k human models so ~2k each for fly and mouse unless there are fun exceptions that split or multiply pathways for these organisms, which there probably are.

deustp01 commented 1 year ago

unless there are fun exceptions that split or multiply pathways

That shouldn't happen. Reactome does the inference by taking a human pathway, asking if there are model organism orthologs of the human proteins associated with the pathway and, if so, creates a version of the pathway in which the human proteins are replaced with model organism counterparts (or a gap when there is no counterpart.

The splitting would come in later, when an expert curator (looking at you @ukemi) looks at the template and decides that the biology of the model organism is better represented by breaking the one human-derived GO-CAM into several.