Closed goodb closed 5 years ago
In contrast to the above, I strongly think we should continue using the full spectrum of entity classes from Reactome in the GO-CAM models displayed in Noctua while maintaining access to mappings from these classes to more generic UniProt identifiers for use in e.g. GPAD conversion.
The overall goal of the Pathways2GO project is to faithfully convert the knowledge in the Reactome kb into knowledge that is represented according to the GO-CAM model and is thus useful within the software infrastructure of the GOC (Noctua, Minerva, RDF database, GPAD export). If successful, this will have a number of positive benefits:
With that reminder of context, the problem driving this particular ticket is that the Reactome knowledge base uses concepts that a) are not defined in any OBO ontology nor are present in any of the non-ontological concept identifier systems used by the GOC (e.g. GPIs, UniProt). In particular, they refer constantly to complexes that are not defined anywhere outside of their knowledge base. They also refer to modified forms of proteins that are sometimes present as isoforms in UniProt and sometimes not and consider proteins and other physical entities as different classes depending on their localization. This violates a central premise of the software stack: All concepts or classes used in a GO-CAM model must be present in the collection of ontologies referred to as and organized in go-lego.owl. Technically, this file is processed when the Minerva/Noctua stack is started and provides the complete collection of identifiers, labels, and logical definitions used to run the system.
In facing the challenge of importing knowledge that uses concepts that are not present in go-lego I see two main options (and have tried both in code):
Based on my work on this project over the past 20 months or so I strongly feel that we need to expand go-lego to reach our goals. The entities used in and defined by Reactome, in particular complexes, represent valid biological classes that are useful for building models of gene function. With the given that “we are the gene ontology”, complexes are still useful entities for establishing the contextual information that makes for rich causal models of individual gene function. I note that one of the core demonstrations used for the project, the Wnt pathway model, requires the use of a complex. The Reactome physical entity collection represents an excellent starting point for defining an expanded view of entity classes within go-lego.
By creating a new ontology (REO) and adding it into the go-lego collection:
ping @ukemi @deustp01 @thomaspd @cmungall for comment.
@goodb this is a very well-laid out argument and makes a lot of sense to me. This does open up the possibility of importing a variety of 'annotation objects' in the future. One point of the argument that rings home with me is the import of Reactome as is stands with no transformations of the entities they represent. One strong advantage to keeping the Reactome entities as they are, in addition to your points above, is that in the future these models will be used as a dialog for co-curation between GO and Reactome. In giving feedback to Reactome, it would be easier for a GO curator to converse with a Reactome curator without having to go through a translation from the model. In my mind the distinction between the curation environment and the 'product' keeps coming to the forefront. In the curation envrironment, we want to keep the data pure with respect to exactly what is being curated (at whatever level). At the product level, it would be nice to serve up the data in a number of ways, translation to conventional files as Ben outlines above. That's the killer step here, the ability to deliver a conventional product from the Reactome models. In the future, I see the products from GO-CAMs evolving where there is a separate visualization interface from Noctua, which would strictly be a curation tool. In those visualizations it would be ideal if we could display entities at whatever level a user desires: gene level, gene product level, complex level.
@deustp01 @thomaspd @cmungall @kltm
I agree with the plan.
The only thing we have to keep an eye on ensuring it is easy for developers to be able to computationally traverse from an activity to the uniprot ID. We can make this easier in a number of ways, I just want to be sure we don't end up with multiple ad-hoc mechanisms.
Noting planning issue assuming above argument is settled: https://github.com/geneontology/pathways2GO/issues/71
Closing. moving on to implementation details in #71
It has been expressed that GO desires uniprot ids to be used within the reactome models when that is possible.
As it stands, the reactome entity ontology contains one record for each reactome entity. This means that we have a different reactome entity (class) for each different version of a protein (where version is determined by location and modifications). Hence we may have 6 reactome identifiers that all map onto one uniprot identifier, despite referring to slightly different concepts ontologically.
The proposal is to replace the use of these (e.g.) 6 different reactome identifiers with the corresponding, more general, uniprot identifier. This is currently done during the conversion of the default reactome-centric gpad export into uniprot-centric gpad export.