Representation of an unknown enabler in Reactome and GO-CAM

huaiyumi commented 7 months ago

There is an email trail among Huaiyu, Peter and David discussing about this. Here is a summary.

In the Reactome reaction (R-HSA-1980118, the "catalyst" has an xref of a ChEBI number CHEBI:36080, which is a protein. The reason is because the catalyst has not been identified. UniProt doesn't have a term (or ID) for an unknown protein. As a workaround, Reactome assigns a ChEBI number to it. In the GO-CAM conversion, Ubiquitin ligase is used as the enabler. Ubiquitin ligase is the label of the catalyst in Reactome, but the actual identity is unknown. Therefore, the enabler label in GO-CAM is kind of misleading. The correct way is probably to leave the enabler blank. In GO-CAM spec, the cardinality of enabler can be 0, meaning unknown.

Here is the full e-mail trail, including digressions to SMBL notation and a similar problem with generic versions of other genome-encoded entities (various RNAs and DNAs) -

unusual reaction.docx

ukemi commented 7 months ago

Interesting. In other places in the imports, I've seen enablers that are labeled something like 'unknown ubiquitin ligase'.

nataled commented 7 months ago

@ukemi at least that restricts the possibilities! Just having 'protein' as enabler, seems to me, would be most accurately interpreted as "any protein will do".

deustp01 commented 7 months ago

But to restrict in a consistent way, we need an ontology structure and curators and users trained in its use. It's probably easier to train curators and users to understand that "unknown protein" means just that: the activity has an enabler whose identity has not yet been discovered, and not that anything can enable it.

ukemi commented 7 months ago

Yes, and in the OWL instance world I believe that having protein there means some protein, not all proteins.

ukemi commented 7 months ago

@deustp01 If a reaction uses a protein that has isoforms, but you don't know which specific isoform is being used, do you annotate to the generic protein identifier?

deustp01 commented 7 months ago

@ukemi by default we always annotate to the canonical / default isoform specified by UniProt unless there is experimental evidence that specifies the use of a different isoform so, yes, in effect we are annotating to the generic identifier because we haven't really examined the possibility of isoform usage.

This is also our rationale for not routinely making sets of all of the isoforms of a UniProt that should be able to enable a particular function and using that set, instead of a single EWAS instance corresponding to the canonical UniProt isoform as the enabler. There is also a biology issue here like the one for paralogs. We assume that all paralogs / all isoforms are equally competent enablers, by default ignoring differences in tissue- or state-specific expression of these variant form that might be pointing to real differences in function (as in the case of the sets of glycolytic enzymes where all set members have the same catalytic activity but are expressed in different tissues and subject to different regulators of their activity).

This leads to problems when UniProt re-edits a SwissProt entry to change the isoform that is the canonical / default one - then our numbering of positions in the protein sequence, e.g. to indicate start and end coordinates and coordinates of specific modified residues can be thrown off. @nataled 's QA tests have enabled us to clean up (almost) all of the 20-year legacy mess caused by these UniProt - Reactome branchings, and we have just introduced a new QA check whereby any change in the checksum of a UniProt entry (which should be triggered by any of these changes in the identity of the canonical sequence) causes all EWASs / proteoforms that refer to the changed UniProt to be flagged for manual review.

geneontology / pathways2GO

Representation of an unknown enabler in Reactome and GO-CAM #303