SysBioChalmers / Human-GEM

The generic genome-scale metabolic model of Homo sapiens
https://sysbiochalmers.github.io/Human-GEM-guide/
Creative Commons Attribution 4.0 International
95 stars 40 forks source link

missing reaction names #181

Open pecholleyc opened 4 years ago

pecholleyc commented 4 years ago

Description of the issue:

A large amount of reactions in the model do not have a descriptive name.

Expected feature/value/output:

More reactions with descriptive names in the model.

Current feature/value/output:

8200+/13400+ reactions without name.

Reproducing these results:

search for - name: ""\n - metabolites in the .yml

Most of the current reaction names in the model are identical to the BiGG or Recon3D annotation. But using the BiGG / Recon3D, KEGG and Reactome external identifiers I estimate that 3500+ additional reaction names could be imported in the model (based on v1.3).

I think names then could also be curated or auto-generated by considering the equation and/or EC code of enzymes associated to the reactions.

I hereby confirm that I have:

haowang-bioinfo commented 4 years ago

@pecholleyc nice to have this issue.

This should be a long-term thing and it will take some time to fully resolve the reaction names. Might be good to begin with 1-2 external id groups for importing the names.

mihai-sysbio commented 2 years ago

While I fully support the idea behind this issue, I don't have a straightforward suggestion here. It feels like there is no "ground truth" database to be used for the reaction names in a way that would resolve a majority of the empty names.

My opinion is that, if possible, this should be scripted in a way that it can be run repeatedly.

haowang-bioinfo commented 2 years ago

Can't agree more

mihai-sysbio commented 2 years ago

I'm at the point where I think any names would be better than the 8000+ reactions with no names.

One way to do this would be to fetch the names in KEGG (example). Alternatively, the names can be fetched based on the E.C. code (example). That sounds more tricky since there are over 7600 empty eccode, and other entries with multiple E.C. codes.

Any thoughts?

mihai-sysbio commented 2 years ago

I hope it's okay to ping @haowang-bioinfo and @JonathanRob to discuss the idea mentioned above: reactions in KEGG have names. Would it make sense to programmatically use KEGG as a source for reaction names?

haowang-bioinfo commented 2 years ago

@mihai-sysbio do you have other suggested sources besides KEGG?

mihai-sysbio commented 2 years ago

@mihai-sysbio do you have other suggested sources besides KEGG?

If we were to use the E.C., there should definitely be other sources (above, I linked to BRENDA). Personally I like the E.C.-based names more since they are more generic in a way (shorter, thus easier to read). However, I believe this should follow only after a curation of the E.C. codes. Moreover, over half of the reactions do not have such codes, and some have multiple. Because of this, I think the approach taken in #367 by using KEGG-provided names is the most reasonable solution we can adopt at the moment.

JonathanRob commented 2 years ago

@mihai-sysbio I'm hesitant about using an E.C.-based approach, since the E.C. number does not necessarily specify the reaction substrates. So in many cases you can have an E.C. that represents a type of reaction, in which many different substrates can participate. If an E.C.-based naming approach was applied to the model, my guess is that it would result in many reactions being assigned similar names.

mihai-sysbio commented 2 years ago

many reactions being assigned similar names

Interesting - do reaction names really need to be unique? I was counting on the uniqueness of the identifiers for that, and the names would be just a more readable/user-friendly string.

JonathanRob commented 2 years ago

They do not need to be unique, but they also should not be super general (to the point where hundreds of reactions have the same name - I'm thinking this is something that may happen with cholesterol or lipid metabolism, for example). But then again, maybe many identical reaction names is still better than no name at all?

haowang-bioinfo commented 2 years ago

using KEGG-provided names is the most reasonable solution we can adopt at the moment

agree and have the same feeling that many identical reaction names is better than no name at all - KEGG reaction names are not very general.

another advantage is that this can be programally implemented

mihai-sysbio commented 2 years ago

There are only 2423 KEGG ids in reactions.tsv - perhaps it would make more sense to extend the coverage via the MNX ids before mapping the names?

edit: with an updated KEGG mapping it might be more tempting to retrieve updated EC codes in addition to reaction names also via KEGG, thus dealing with #366

haowang-bioinfo commented 1 year ago

Come up with an idea to move this long-term goal one step further:

The plan is to firstly locate reactions that are catalyzed by only one gene, i.e. single-gene-reaction, then go through these reactions and fill in empty reaction names by using the gene names extracted from genes.tsv file, which is based on Ensembl annotation.

feiranl commented 1 year ago

So where do those reactions come from, there is no reaction name in their origin?

haowang-bioinfo commented 1 year ago

So where do those reactions come from, there is no reaction name in their origin?

they were inherited from HMR2 where reactions have no names originally

JonathanRob commented 1 year ago

Earlier it was suggested that we should have some scripted way to do this so that it could be run repeatedly. I've thought about it and don't think that this is necessary. The name of a reaction is not really something that needs to be updated very often, if at all. So even a one-shot, fairly manual approach to filling in the reaction names should be sufficient.

feiranl commented 1 year ago

We can try to map all external IDs to get reactions names as much as possible. For exchange and pseudo reactions, we just assign a reaction names such as Exchange glucose, transport glucose from c to m, or pseudo reaction. May I know the coverage of reaction with at least one external database ID such as KEGG/MetaNetX?

mihai-sysbio commented 1 year ago

I guess that's a quick pandas/Excel question - 5885 reactions have no KEGG/MetaNetX/Rhea id mapped in reactions.tsv.

haowang-bioinfo commented 1 year ago

5885 reactions have no KEGG/MetaNetX/Rhea id mapped in reactions.tsv.

Among these 5800+ reactions, 1700+ are single-gene-reactions so that the names could be assigned via their gene names.