feat: Improve reaction, metabolite, and gene annotations

JonathanRob commented 3 years ago

Description of the issue:

Human-GEM reactions, metabolites, and genes are associated to many external identifiers from databases such as KEGG, BiGG, ChEBI, and MetaNetX. However, many associations are outdated or missing. It would be great to not only update old IDs, add missing IDs, and correct any erroneous IDs, but to do so in an automated manner such that we can simply re-run the pipeline again in the future to ensure that the IDs remain current.

Expected feature/value/output:

An automated or semi-automated pipeline/script that updates old/missing/incorrect reaction, metabolite, and gene identifiers present in the reactions.tsv, metabolites.tsv, and genes.tsv annotation files. It may be beneficial to develop this in python so it can be implemented as a GitHub action.

Current feature/value/output:

Many model components are missing annotation information and/or may have outdated or incorrect annotation information.

I hereby confirm that I have:

[X] Checked that a similar issue does not exist already

haowang-bioinfo commented 3 years ago

This is certainly an important issue.

However, it appears that there is no readily acceesbile pipeline(s) that are tuned for GEM curation. We thus welcome the contributions and collaborations from, as well as serve for, the research community toward such a (semi-)automated pipeline.

mihai-sysbio commented 3 years ago

update old IDs, add missing IDs, and correct any erroneous IDs

Intuitively, these look like different problems requiring independent solutions. Based on how the curation progress so far, it would look to me like adding missing IDs has do be done manually. A script (GH Action) could be set up to cross-reference IDs to highlight the erroneous ones, but resolving the conflicts manually might be more straightforward, at least in the near future.

re-run the pipeline again in the future to ensure that the IDs remain current

Once there is an ID that gets outdated perhaps a new one can be provided by the reference database.

jorgemlferreira commented 3 years ago

Hello. Don't know if this is the right way to do this, but since in this thread is about annotations, both m01778c (elaidate) and m02646c (oleic acid, oleate) share the same pubchem.compound ID (445639), whereas m01778c should be 5461071 and m02646c should be 445639. If it is necessary to open a new thread, I'll do it. I'd be a good idea to open an issue where people could report this kind of issues when dectected, and be integrated into the pipeline.

Best regards

haowang-bioinfo commented 3 years ago

@jorgemlferreira please open an issue for this, and very welcome for the contribution!

jorgemlferreira commented 3 years ago

I'll do it, thanks and keep up the good work!

pecholleyc commented 3 years ago

In order to improve the annotations, Hao and I have mapped Human-GEM reactions to Rhea. The mapping was done using 2 methods: 1) Using UniProt. Only the reactions with a single Ensembl gene ID associated to only one UniProt ID were considered. From that list of UniProt only entries having a single catalytic reaction were selected. The Rhea ID provided by UniProt was then mapped to MetaNetX (4.1) xref file. There is association if the MNX ID retrieved is identical to the MNX ID stored in the reactions.tsv annotation file. 2) Using the equation. All the reaction from Rhea that are not undirected (i.e. the equation with the '=' sign where excluded) were mapped to the reaction equation of Human-GEM by comparing the metabolite names, the stoichiometry and the equation sign (<=, =>, <=>). Only perfect hits were retained. In Order to improve the mapping a few metabolite synonyms, manually verified were considered by the algorithm: met_synonyms.txt.

The IDs mapped and the corresponding reaction equations are described here.

Only 570 Rhea IDs could be added, but the aim was to avoid False positive hits.

haowang-bioinfo commented 3 years ago

@pecholleyc nice addition to reaction annotation. The provided equation comparison looks promising and enables easy reviewing. Excellent work!

Some comments:

Rhea is a reliable curated source and with built-in UniProt links. Though it always assign reaction with 4 directions: a master one in undefined direction and three others (<=, =>, <=>). So the mapping to Rhea is actually an association from one to a group of 4 reactions, while the members often are further mapped to different databases (e.g. one to KEGG, another to Reactome) via master id. Then it might be good also include another column rxnRheaMasterID, through which the advantage of Rhea can thus be fully utilised, toward a more comprehensive coverage in annotation.
Please also include met_synonyms.txt to this branch but in tsv format. It hosts useful information of met names mapping, which seems to be something we should start to build up.

mihai-sysbio commented 3 years ago

Expected feature/value/output:

An automated or semi-automated pipeline/script that updates old/missing/incorrect reaction, metabolite, and gene identifiers present in the reactions.tsv, metabolites.tsv, and genes.tsv annotation files. It may be beneficial to develop this in python so it can be implemented as a GitHub action.

@pecholleyc if there is code to share as well that would be great.

edkerk commented 3 years ago

Regarding MetaNetX, know that since version 4.0 there is a dataset with deprecated MetaNetX IDs and their replacement (e.g. chem_depr.tsv):

#deprecated_ID  ID      version
MNXM1000    MNXM8962    3.*

while the xref dataset (e.g. chem_xref.tsv) now contains entries in the description column stating "secondary/obsolete/fantasy identifier", for instance as:

biggR:EX_sel_e      MNXR104327  Selenate exchange||1 biggM:sel@BOUNDARY = 1 biggM:sel@biggC:e
biggR:R_EX_sel_e    MNXR104327  secondary/obsolete/fantasy identifier

Although also note that MetaNetX data sources might not be the newest versions.

pecholleyc commented 3 years ago

@Hao-Chalmers, good suggestions, I have added this information in the branch. Note that for Rhea IDs mapped using only the UniProt strategy rnxRheaID = rxnRheaMasterID.

@mihai-sysbio, unfortunately there is no working pipeline yet to close this issue, but we aim for that direction. I will put the 'pipeline' (it is more a collections of python scripts) in the Sandbox repository in the following days. Maybe it can be then used as precursor to build the pipeline. Actually I think it would be beneficial to discuss the content of this pipeline in details.

@edkerk Indeed, the secondary/obsolete/fantasy identifiers were discarded when parsing the xref file. But I understand that relying on MNX IDs to map back Reaction ID in human-GEM is a weak point in the algorithm. External databases might bring invalid information/mapping so I would imagine the pipeline to cross-validate each current or new Identifiers with the existing ones. And I would also imagine the pipeline would store a list of manually curated IDs that would resolve any contradictions between the databases.

haowang-bioinfo commented 3 years ago

@pecholleyc look forward to the PR

mihai-sysbio commented 3 years ago

A few PRs have been merged since the last comment - what is the status of this issue?

haowang-bioinfo commented 3 years ago

A few PRs have been merged since the last comment - what is the status of this issue?

my view is this is an ongoing process still.

mihai-sysbio commented 3 years ago

Indeed, but I don't see immediate actionable points. Perhaps it's time to turn this into a discussion?

haowang-bioinfo commented 3 years ago

agree!

mihai-sysbio commented 3 years ago

Closing this issue now. The discussion should continue under https://github.com/SysBioChalmers/Human-GEM/discussions/253, or maybe a new thread.

SysBioChalmers / Human-GEM