geneontology / neo

noctua entity ontology
9 stars 2 forks source link

Some Reactome Identifiers are not resolving #91

Open ukemi opened 2 years ago

ukemi commented 2 years ago

I noticed this morning that in the Reactome models, some identifiers are being displayed as IRIs instead of strings. This is a new issue and I assume that it was introduced with last night's NEO update?????? For example: In model http://noctua.geneontology.org/editor/graph/gomodel:R-HSA-196741 The input for 'Endosomal GIF:Cbl translocates to lysosome' is obo:go/extensions/reacto.owl#REACTO_R-HSA-3000295'. I am certain that this used to have a label. This is also a pathway that has been substantially modifed in Reactome with the new release. Is it possible that the update of NEO took information about entities that have changed recently in Reactome, while the rest of the model wasn't updated and this is causing problems? If this is the case, then it points to us needing an SOP for large-scale data changes to models imported from external resources into the GOC framework. Perhaps these kinds of changes should be coordinated with a complete refresh of the import data. ping @deustp01

deustp01 commented 2 years ago

This is also a pathway that has been substantially modifed in Reactome with the new release.

And this very instance no longer appears in the latest version of the pathway annotated in Reactome. Could an early-2022 change in a Reactome instance propagate back into a 2021 GO-CAM model?

ukemi commented 2 years ago

Yes. I suspect that REO was updated to reflect the latest Reactome data as part of the NEO rebuild, but the models weren't updated. The result was not finding the entities in REO and reverting to the IRIs in the existing out-of-date models.

vanaukenk commented 2 years ago

Not sure how this relates to the work that happened here: https://github.com/geneontology/neo/issues/82 but we need to understand better how NEO and REACTO interact and how this happened and how soon we can get some feedback or report on missing entities.

balhoff commented 2 years ago

Yes. I suspect that REO was updated to reflect the latest Reactome data as part of the NEO rebuild, but the models weren't updated. The result was not finding the entities in REO and reverting to the IRIs in the existing out-of-date models.

@ukemi I think you're right, and that REACTO is built from the current Reactome data each time, so that if an identifier is removed from Reactome it will be removed from REACTO, rather than kept and obsoleted. Should we add something to the REACTO build that retains any missing IDs?

deustp01 commented 2 years ago

Should we add something to the REACTO build that retains any missing IDs?

At the level of grand strategy, I suspect the answer is "no" - we should be dropping old sets of Reactome-derived GO-CAMs and reloading new sets regularly (e.g., every 3 months in synchrony with new Reactome releases), and the function of a checking tool would be to flag any discrepancies and report them back to Reactome ot be fixed there, not patched on the fly in the Reactome-derived GO-CAMs. Anyway that's how I understood our discussions.

kltm commented 2 years ago

Okay, so the way this seems to currently stand, the issue is "no label" and the fix is either

Practically speaking though, right now, I'm not seeing an action here to be taken as part of this project. While users coming in to view the model might be a little confused as to the lack of a label(there do not see to me too many of those at the moment), the data is "correct" as it currently stands? I'm not sure what the implications are for identifier destruction for us--I usually assume that doesn't happen. I've added this to the agenda for this week's technical call.

ukemi commented 2 years ago

I think that @balhoff brings up an interesting point here. We are creating an ontology from something that is not an ontology. Good practice dictates, I think, that classes never just go missing. They should be obsoleted. I suspect this will extend beyond Reactome entities to other gene and protein objects as well. Since the entities used to build the ontology are all imported from either GPIs or in this case the Reactome BioPax, do we want to take the job on at the NEO end to 'obsolete' a class if it is no longer present in an import? What if they come back in a future load? Can we resurrect them?

deustp01 commented 2 years ago

@ukemi @balhoff Despite what I said above about obsolete instances simply disappearing, within the Reactome data structure we track obsoletions of instances of the event and entity classes, so when one is obsoleted a "deleted" record is created to record the fact of deletion, a one-word reason (obsoleted, merged, replaced, ...) and where appropriate the dbID of the replacement instance. I don't know how much of this information gets into the BioPAX export, but that would be something to investigate.

But the whole list of every instance whose deletion has been annotated in this way is visible here. For each instance, its "(deletedInstance)" attribute points to its replacement, if any. This list has gaps where deletions and obsoletions were done without proper annotation. Current practice is better.

vanaukenk commented 2 years ago

What about extending the GPI2.0 file format to capture things like gene model merges, e.g. a new column 'replaced by' or 'merged into'? We would have that information in WB, as I suspect other groups do as well, so in theory that could be included in the GPI2.0 file and used to update models, if desired.

nataled commented 2 years ago

Can PRO help here? I'm wondering how many of the Reactome identifiers used in the current set of GO-CAMs are already represented in PRO. Can someone send a list of these? I'll return that list with a mapping to PRO so we can get a handle on where we're at.

deustp01 commented 2 years ago

@nataled Right now, probably not, because the problem appears to be that some recent edits in Reactome instances put them out of synch with the June 2021 versions of those instances that are in the GO-CAM models, and that disconnect is messing things up. Once we get frequent re-builds of the GO-CAMs and with PRO IDs in use, opportunities for this kind of disconnect should mostly be eliminated.