compatibility of identifiers with BIGG database

snmendoz commented 6 years ago

Description of the issue:

Dear all,

I would like to use yeast-GEM as a template for reconstructing other yeast metabolic networks. For example, I could use RAVEN, AuReMe or MetaDraft to create draft models for my yeast from yeast-GEM. The problem is that If I would want to use other templates in addition to yeast-GEM (for example, models from the BIGG database) the automatic merge of the output draft networks (from each template) wouldn't be possible because of identifiers incompatibility.

For me, it would be of great help if the model would be written with BIGG identifiers but I can imagine a series of problems:

It would be time-consuming to change all the identifiers
It could be possible that some metabolites in the model have different charges with regard to the ones in the BIGG database, so strictly, they are not the same because they would have a different chemical formula.
It could be possible that the one-to-one metabolite mapping would not be possible because sometimes two metabolites (alpha-D-glucose, beta-D-glucose) are represented as one in other databases (D-glucose)

@BenjaSanchez already mentioned two additional problems Backward compatibility: People is already using the model as it is so new identifiers would mean that their current code won't work anymore. Possible solution: I could make a script with a dictionary, so people could convert one format to another with this script Missing IDs: what should we do with identifiers that are not in the BIGG database? Possible solution: create new identifiers in a systematic and automatic way.

What do you think? I already have some scripts to change identifiers, so I think it wouldn't be so problematic for me to make this change. If I change the identifiers, would be worth it for other people too?

I hereby confirm that I have:

[X] Tested my code with all requirements for running the model
[X] Done this analysis in the master branch of the repository
[X] Checked that a similar issue does not exist already
[X] If needed, asked first in the Gitter chat room about the issue

BenjaSanchez commented 6 years ago

@snmendoz thanks for opening this issue! Including BIGG ids would for sure be a great contribution, and if you already have some work on it, even better. See if you can complement your matching with the recently included metaNetX ids we added for rxns and mets in https://github.com/SysBioChalmers/yeast-GEM/pull/167.

I could make a script with a dictionary, so people could convert one format to another with this script

I would say that would be the best for now; you could even add that conversion as an option in loadYeastModel.py, and store the dictionary in /ComplementaryData/databases. Feel free to try it out in a fork and let me know if you need any help.

create new identifiers in a systematic and automatic way.

I would suggest just leaving the current rxn/met id if there is no match to BIGG. Perhaps remove the first _ character, as it creates problems with memote, i.e.:

For an unmatched reaction: r_XXXX -> rXXXX
For an unmatched metabolite: s_XXXX_c -> sXXXX_c

different charges, not 1-1 relationships in the databases, etc...

Maybe we can look into this later depending on how often it happens. In case of doubt, the solution of leaving the previous id could also be used.

As a final thought, note that if you add said dictionary to the repo and the coverage is good, we could 1) implement a conversion also for loadYeastModel.m, and 2) switch the default storage of the model in the repo to BIGG , that way we would remain BC for users that use the old ids, but have all model files with BIGG ids, better connecting the model to memote, cobrapy, escher, etc. etc...

Looking forward then to your dictionary! :)

snmendoz commented 5 years ago

@BenjaSanchez would you please assign this task to me to avoid duplicated efforts?. I am almost done with the translation to BIGG identifiers :)

snmendoz commented 5 years ago

Hi @BenjaSanchez @hongzhonglu @feiranl

I will summarize here what I did to get your approval regarding the modifications I did to the model. I would appreciate your feedback. As soon as I get your approval I will make a pull request. Otherwise, I am happy to make corrections according to your suggestions.

1) change compartment identifiers to match those in the BiGG database (http://bigg.ucsd.edu/compartments). In particular, "er" was changed to "r" and "p" to "x"

2) I created four fields, called "metBiGGIDs", "rxnBiGGIDs", "metSIDs" and "rxnRIDs". These are to store BiGG identifiers and the current identifiers of the model (e.g. s_0001, r_0001), respectively. This is needed to transform one version into another in an easier way.

3) I filled cells in metBiGGIDs when a synonym was found in the BiGG database. Otherwise, I kept an empty cell. I did the mapping with MNX and manual curation. I did the mapping in a way that all the metabolites were mapped to non-redundant identifiers, so the length of the array of metabolites was kept constant.

4) I developed two scripts: a) one to match reaction equations based on reaction properties (metabolites, stoichiometry). This was based on the script https://github.com/SystemsBioinformatics/pub-data/blob/master/reconstruction-tools-assessment/pipeline/code/analysis/reactionFormulaInModel.m b) one to find reactions with the same structure but in different compartments. Although this script was not used to map reactions to the BiGG database, it was useful to create new IDs with the same structure than othes in the BiGG database, and to get metadata.

5) I used the first script to fill the cells in rxnBiGGIDs. A manual revision was performed when two or more reactions were found in the BiGG database for the same reaction in the consensus model. The mapping was done in a way that reactions were mapped to non-redundant identifiers. Duplicated reactions were kept in the model (issue #187 ). In the case of match with the BiGG database, one of the duplicated reactions received the ID in the BiGG database and the other got a new ID. In addition, we used the second to suggest IDs for some reactions in BiGG which have the same structure but in other compartments, as I mentioned before.

6) New identifiers were assigned to unmapped metabolites and reactions. For the moment, these metabolites were not included in the model but an excel file was created.

7) I created a script to transform from the current identifiers to BiGG identifiers. This function has two modes of operation. In the first mode, it transforms just metabolites and reactions which have a translation to BiGG identifiers. In the second mode. It transforms the entire model using the excel files created in step 6. This second mode of operation could be useful to create a model with new, BiGG-like identifiers and therefore the generated model can be submitted to the BiGG. I totally recommend submitting the model to the BiGG database. It would increase the size of the database in ~10%, which is quite high. This would strengthen the importance of the consensus model because BiGG would increase significantly its size as a knowledgebase.

In general terms, I created a script to make the dictionary and another to transform the model from the current identifiers to BiGG, and vice-versa.

This is a group contribution. Contributions are listed according to the CRedit Taxonomy: Sebastian Mendoza: Data Curation, Software. Bas Teusink: Resources, Supervision.

hongzhonglu commented 5 years ago

@snmendoz very nice work! One comment here,

'For the moment, these metabolites were not included in the model', these metabolites could also affect the model function. So how do you evaluate it?

snmendoz commented 5 years ago

@hongzhonglu sorry. What I meant was that the new BiGG identifiers (for those metabolites which do not have a translation to the BiGG database) were not included in the model. The metabolites in the model are the same than always.

BenjaSanchez commented 4 years ago

update: BiGG ids were added in #188 and an option for switching ids in #224. This issue will be closed on the next release.

SysBioChalmers / yeast-GEM

compatibility of identifiers with BIGG database #172

Description of the issue: