add reactions based on KEGG and MetaCyc annotations

cheng-yu-zhang commented 2 years ago

Main improvements in this PR:

Try to be as clear as possible: Is it fixing/adding something in the model? Is it an additional test/function/dataset? PLEASE DELETE THIS LINE.

First, construct two draft models using RAVEN Toolbox. Model file are in Saccharomyces_cerevisiae_draftmodel_kegg and Saccharomyces_cerevisiae_draftmodel_metacyc
Then compare the draft models with yeast8 to find the new reactions.
In terms of the new reations from step 2, check if their are reasonable using metacyc, yeastcyc, uniprot, SGD and KEGG.
Get the final new reactions.

I hereby confirm that I have:

[x] Tested my code with all requirements for running the model
[x] Selected develop as a target branch (top left drop-down menu)
[x] If needed, asked first in the Gitter chat room about this PR

edkerk commented 2 years ago

@cheng-yu-zhang could you please explain a bit what was exactly done in this PR (and the other two that you opened)? Like where did you get the information from, why did you make these changes, perhaps any special cases or considerations? What solved the growth problem that you encountered?

cheng-yu-zhang commented 2 years ago

@edkerk that's my fault, i wiil detail more information.

edkerk commented 2 years ago

I have reorganized the data, to fit #302.

[x] Make sure that there are no more duplicated metabolites (3f1646d)
[x] Make sure that there are no unused metabolites (26b55d8)
[x] There should be much more metadata provided in DBnewRxnsMets.tsv, DBnewRxnsRxns.tsv and now also DBnewRxnsGenes.tsv
[x] Many of the subsystems are very unique, and this does not make sense in accordance to #11 and #307.
[x] How is the compartmental localization determined?

But overall, I'm not convinced whether all these reactions should be included. What criteria were used to include them? What experimental evidence is there to support them? [To facilitate this, I changed the layout of the yeast-GEM.txt file (cb966bc, using exportForGit), which makes for easier diff-ing in 25b724b.]

Some examples:

rxnID	reaction equation	grRule
r_4855	oxygen[c] + Melatonin[c] => Formyl-N-acetyl-5-methoxykynurenamine[c]	YJR078W
r_4810	oxygen[c] + Serotonin[c] => Formyl-5-hydroxykynurenamine[c]	YJR078W

These are probably not correct. The breakdown of melatonin and serotonin, which are not yeast metabolites, has the same EC number as the reaction from tryptophan to N-formyl-kynurenine, which is a reaction in NAD biosynthesis. Actually, there are four reactions in this map with the same EC number, but only one of these is part of a functional pathway.

There are more examples like this, also based on MetaCyc. So how were these reactions selected?

Then there are also other problematic reactions. The following two reactions are modifying proteins, which is outside the scope of a metabolic network. Moreover, they are actually half-reactions of pyruvate dehydrogenase and alpha-ketoglutarate dehydrogenase (both already in the model, and associated with the same genes). So no need to include these:

rxnID	reaction equation	grRule
r_4833	coenzyme A[m] + pyruvate-dehydrogenase-acetylDHlipoyl[m] => acetyl-CoA[m] + pyruvate-dehydrogenase-dihydrolipoate[m]	YNL071W
r_4834	succinyl-CoA[m] + N6-dihydrolipoyl-L-lysine[m] <=> coenzyme A[m] + N6-S-succinyldihydrolipoyl-L-lysine[m]	YDR148C

There are other reactions that act on non-specific substrates:

rxnID	reaction equation	grRule
r_4755	2 H+[c] + H2O[c] + L-Selenocystathionine[c] => ammonium[c] + pyruvate[c] + Selenohomocysteine[c]	YGL184C or YHR112C or YFR055W
r_4835	H2O[c] + S-Substituted-L-Cysteines[c] => ammonium[c] + pyruvate[c] + Thiols[c]	YGL184C or YFR055W

There has been some discussion about including non-specific substrates (#219), but these genes are already associated to existing reactions (r_0308), so there is no value of including it as non-specific reactions.

There are also examples of fluorinated and chlorinated compounds that would not occur in S. cerevisiae.

Overall: The list of new reactions should be carefully cureated, to make sure that the models that are added make sense. More reactions is not perse better, even if it would not directly affect some of the model metrics (predicted growth rate, gene essentiality etc.).

cheng-yu-zhang commented 2 years ago

@edkerk Is there any issue about the new reactions that I need to fix?

edkerk commented 2 years ago

I have refactored the script and location of datafiles to match the generic curation format introduced in #313. See code/modelCuration/v8_6_1.m for how the model curation is performed.

I reiterate the last sentence of the previous comment: The list of new reactions should be carefully curated, to make sure that the models that are added make sense. More reactions is not perse better, even if it would not directly affect some of the model metrics (predicted growth rate, gene essentiality etc.).

So you should go through the list of reactions 1-by-1 and manually check whether they make sense. You uploaded draft models from KEGG and MetaCyc, but there is no explanation given which reactions are then included and why. I quickly looked through the new reactions, and found some more issues:

Double check that it is not duplicate of an existing reaction.

rxnID	reaction equation	grRule
r_0916	ATP[c] + ribose-5-phosphate[c] => AMP[c] + H+[c] + PRPP[c]	(YKL181W and YER099C) or (YKL181W and YHL011C) or (YKL181W and YBL068W) or (YER099C and YOL061W) or (YBL068W and YOL061W)
r_4723	ATP[c] + D-ribose 5-phosphate[c] <=> AMP[c] + H+[c] + 5-Phospho-alpha-D-ribose 1-diphosphate[c]	YBL068W or YHL011C or YER099C or YOL061W or YKL181W

The first reaction was already present, while the second reaction has different metabolite names, it represents the same reaction. This also highlights that there are duplicate metabolites, which otherwise would have made it easier to spot.

Double check that there are no duplicate metabolites.

See above, even if the reaction would not have been duplicate, then ribose-5-phosphate and ´D-ribose 5-phosphate` are highly likely the same metabolite, so make sure there is only one of them present.

Double check whether the reaction is likely to be present in S. cerevisiae

rxnID	reaction equation	grRule
r_0481	glutathione disulfide[c] + H+[c] + NADPH[c] => 2 glutathione[c] + NADP(+)[c]	(YCL035C and YPL091W) or (YDR098C and YPL091W) or (YDR513W and YPL091W) or (YER174C and YPL091W)
r_4711	2 glutathione[c] + NAD[c] <=> glutathione disulfide[c] + H+[c] + NADH[c]	YPL091W

The first reaction is how glutathione oxidoreductase is widely accepted to function. The new reaction is reversible, uses NADH and has a much simplified gene association. What strong evidence is there to include the second one?

Double check the gene associations

See both examples above, the new reactions have much simplified gene associations, while the old reactions indicate complexes with subunits. What strong evidence is there to have the simplified gene association?

See previous comment

But it's worthwhile to have another look at the previous comment as well, as these issues are not fully resolved. How is the localization determined? Be very careful with reactions predicted by MetaCyc, it can quickly draw in non-native substrates.

cheng-yu-zhang commented 1 year ago

@edkerk Hi, Ed. I encounter a problem. When I fail to run deletion = cobra.flux_analysis.deletion.double_gene_deletion(model, gene_list1=pair1, gene_list2=pair2) in python using yeast-GEM from both main branch and develop branch. Even if I change the version of cobrapy, I can not solve it. So, I am wondering if saveYeastModel.m has changed. The error is below:

Traceback (most recent call last): File "D:\Anaconda\envs\python38\lib\site-packages\IPython\core\interactiveshell.py", line 3444, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File "", line 47, in deletion = double_gene_deletion(model, File "D:\Anaconda\envs\python38\lib\site-packages\cobra\flux_analysis\deletion.py", line 393, in double_gene_deletion return _multi_deletion( File "D:\Anaconda\envs\python38\lib\site-packages\cobra\flux_analysis\deletion.py", line 144, in _multi_deletion with ProcessPool( File "D:\Anaconda\envs\python38\lib\site-packages\cobra\util\process_pool.py", line 56, in init pickle.dump((initializer,) + initargs, handle) TypeError: cannot pickle 'SwigPyObject' object

cheng-yu-zhang commented 1 year ago

@edkerk @hongzhonglu Are there any methods to solve the above problem?

edkerk commented 1 year ago

Hmm, even if saveYeastModel is changed, it would still produce a valid SBML file that cobrapy should be able to import without issues. Just to confirm that it is really a problem with the model itself, have you tried running it on another model (non yeast-GEM, maybe E. coli?).

cheng-yu-zhang commented 1 year ago

@edkerk @hongzhonglu double_gene_deletion and single_gene_deletion could be perfectly performed in iML1515 and yeast-GEM 8.5. But in the latest yeast-GEM, somthing goes wrong.

However, matlab can run double_gene_deletion with a solvable problem. And I am working on it.

edkerk commented 1 year ago

I went through all suggested reactions, checked them one-by-one. With the quality of the current yeast-GEM, one should be careful to include new reactions, there should be more evidence than it appearing in KEGG. I checked with the following strategy:

Check if the reaction is not a partial reaction, which is already represented in the model as the complete reacton.
Compare the new reaction with existing reactions annotated to the same gene: if there is a difference (in e.g. substrate or co-factor), find evidence in literature if the new reaction is supported and/or likely to be present. Not only guided by KEGG or UniProt, but search for more solid evidence.
If the above are true, then see if the reactants and/or products connect to existing metabolites. If so, then include the reaction in that compartment, but do not add it to other compartments. This should rather be addressed by a thorough curation of all reaction compartmentalizations. If the reaction does not connect to the existing metabolic network, then just add it to whatever compartment is suggested.

cheng-yu-zhang commented 1 year ago

I went through all suggested reactions, checked them one-by-one. With the quality of the current yeast-GEM, one should be careful to include new reactions, there should be more evidence than it appearing in KEGG. I checked with the following strategy:

Check if the reaction is not a partial reaction, which is already represented in the model as the complete reacton.

Compare the new reaction with existing reactions annotated to the same gene: if there is a difference (in e.g. substrate or co-factor), find evidence in literature if the new reaction is supported and/or likely to be present. Not only guided by KEGG or UniProt, but search for more solid evidence.

If the above are true, then see if the reactants and/or products connect to existing metabolites. If so, then include the reaction in that compartment, but do not add it to other compartments. This should rather be addressed by a thorough curation of all reaction compartmentalizations. If the reaction does not connect to the existing metabolic network, then just add it to whatever compartment is suggested.

I agree with the detailed strategy. With a standard workflow, we can add new reactions more efficiently and credibly.

SysBioChalmers / yeast-GEM