UniProt versus reactome entity id issue

goodb commented 5 years ago

It has been expressed that GO desires uniprot ids to be used within the reactome models when that is possible.

As it stands, the reactome entity ontology contains one record for each reactome entity. This means that we have a different reactome entity (class) for each different version of a protein (where version is determined by location and modifications). Hence we may have 6 reactome identifiers that all map onto one uniprot identifier, despite referring to slightly different concepts ontologically.

The proposal is to replace the use of these (e.g.) 6 different reactome identifiers with the corresponding, more general, uniprot identifier. This is currently done during the conversion of the default reactome-centric gpad export into uniprot-centric gpad export.

goodb commented 5 years ago

In contrast to the above, I strongly think we should continue using the full spectrum of entity classes from Reactome in the GO-CAM models displayed in Noctua while maintaining access to mappings from these classes to more generic UniProt identifiers for use in e.g. GPAD conversion.

The overall goal of the Pathways2GO project is to faithfully convert the knowledge in the Reactome kb into knowledge that is represented according to the GO-CAM model and is thus useful within the software infrastructure of the GOC (Noctua, Minerva, RDF database, GPAD export). If successful, this will have a number of positive benefits:

The Reactome knowledge can be made available to consumers of the GO-CAM model collection in a coherent manner that does not require them to process it any differently than the rest of the models. This will effectively double the size of the GO-CAM collection, providing much greater value to the community of e.g. next generation gene set enrichment tool developers and their users.
The Reactome models can be made available to GO curators:
- As starting points or templates for building new models (e.g. the same pathway but in a new species.)
- As building blocks for creating new models (e.g. re-using well known functions or processes)
- As examples for how the leadership of the GOC wants certain kinds of things structured
The process for completing this conversion can be used to inform imports from many other pathway knowledge sources (e.g. OpenCyc, WikiPathways, Pathway Commons, etc.)

With that reminder of context, the problem driving this particular ticket is that the Reactome knowledge base uses concepts that a) are not defined in any OBO ontology nor are present in any of the non-ontological concept identifier systems used by the GOC (e.g. GPIs, UniProt). In particular, they refer constantly to complexes that are not defined anywhere outside of their knowledge base. They also refer to modified forms of proteins that are sometimes present as isoforms in UniProt and sometimes not and consider proteins and other physical entities as different classes depending on their localization. This violates a central premise of the software stack: All concepts or classes used in a GO-CAM model must be present in the collection of ontologies referred to as and organized in go-lego.owl. Technically, this file is processed when the Minerva/Noctua stack is started and provides the complete collection of identifiers, labels, and logical definitions used to run the system.

In facing the challenge of importing knowledge that uses concepts that are not present in go-lego I see two main options (and have tried both in code):

Attempt to represent the concepts at the level of the instance graph itself. For example, we can try to represent the concept of a Reactome complex as an instance of the generic GO:Protein-containing complex and add some knowledge of that concept into the model by adding has_part relationships between the complex node and the proteins and other molecules that make it up.
- After many, many iterations, we found this process to be unsatisfactory because there were too many cases that simply were not a natural match to instance graph modeling. The final straw was the challenge of handling the idea of a Set in reactome. In a Reactome set, any member of the set could be used to fill the role in the model. Without creating new instances for each member of each set and thus dramatically expanding the size and decreasing the coherency of the models, we could not capture this information in the instance graph. See #61 for discussion and resolution to seek a different solution.
Import the missing concepts into the go-lego ontological universe. As described in #61 this is the current approach taken within the Reactome to GO-CAM project. By creating a new OWL ontology (lets call it REO) that accurately and completely captures the missing concepts from Reactome and adding it to go-lego, we can construct models that work perfectly with the existing software stack including reasoner, editor, viewer and databases. There are still problems with this approach:
- the new ontology (here REO) needs to be accepted by the GOC as a full member of go-lego. If not, the products of existing software (e.g. the GPAD export from Noctua, the RDF database) will contain unapproved concepts.
- Even if REO gains acceptance, we likely want to work out simplified mappings to generic, commonly accepted gene identifier space to ease use of the data at the GPAD level.

Based on my work on this project over the past 20 months or so I strongly feel that we need to expand go-lego to reach our goals. The entities used in and defined by Reactome, in particular complexes, represent valid biological classes that are useful for building models of gene function. With the given that “we are the gene ontology”, complexes are still useful entities for establishing the contextual information that makes for rich causal models of individual gene function. I note that one of the core demonstrations used for the project, the Wnt pathway model, requires the use of a complex. The Reactome physical entity collection represents an excellent starting point for defining an expanded view of entity classes within go-lego.

By creating a new ontology (REO) and adding it into the go-lego collection:

we can use expressive OWL logic to capture the information in Reactome in a lossless manner
we can use the same logic as well as existing mappings to align the Reactome entities with other OBO ontologies like PRO and databases like UniProt and Complex Portal
we can define the ontology such that tools with standard OWL inference capabilities can provide access to information in GO-CAM models at different levels of granularity. As an example in the gene space, REO has classes for specific genes like “Reactome:R-HSA-947607” that are defined as subclasses of more generic entities from UnitProt, e.g. uniprot:A5LHX3. When a model states that an instance of Reactome:R-HSA-947607 enables some function, an inference-capable computing environment will recognize that that instance is also an instance of uniprot:A5LHX3. This means that queries for annotations at the generic level of uniprot:A5LHX3 would yield annotations to the more specific representations subclassed to it. This pattern, as opposed to immediately converting the reactome gene id into the more generic UniProt id allows us to answer the same query for annotations in the same way while we maintain the additional information (e.g. modifications, localization, mappings to other databases) on the Reactome-specific subclass in the ontology.

ping @ukemi @deustp01 @thomaspd @cmungall for comment.

ukemi commented 5 years ago

@goodb this is a very well-laid out argument and makes a lot of sense to me. This does open up the possibility of importing a variety of 'annotation objects' in the future. One point of the argument that rings home with me is the import of Reactome as is stands with no transformations of the entities they represent. One strong advantage to keeping the Reactome entities as they are, in addition to your points above, is that in the future these models will be used as a dialog for co-curation between GO and Reactome. In giving feedback to Reactome, it would be easier for a GO curator to converse with a Reactome curator without having to go through a translation from the model. In my mind the distinction between the curation environment and the 'product' keeps coming to the forefront. In the curation envrironment, we want to keep the data pure with respect to exactly what is being curated (at whatever level). At the product level, it would be nice to serve up the data in a number of ways, translation to conventional files as Ben outlines above. That's the killer step here, the ability to deliver a conventional product from the Reactome models. In the future, I see the products from GO-CAMs evolving where there is a separate visualization interface from Noctua, which would strictly be a curation tool. In those visualizations it would be ideal if we could display entities at whatever level a user desires: gene level, gene product level, complex level.

@deustp01 @thomaspd @cmungall @kltm

cmungall commented 5 years ago

I agree with the plan.

The only thing we have to keep an eye on ensuring it is easy for developers to be able to computationally traverse from an activity to the uniprot ID. We can make this easier in a number of ways, I just want to be sure we don't end up with multiple ad-hoc mechanisms.

goodb commented 5 years ago

Noting planning issue assuming above argument is settled: https://github.com/geneontology/pathways2GO/issues/71

goodb commented 5 years ago

Closing. moving on to implementation details in #71

geneontology / pathways2GO

UniProt versus reactome entity id issue #70