SysBioChalmers / yeast-GEM

The consensus GEM for Saccharomyces cerevisiae
http://sysbiochalmers.github.io/yeast-GEM/
Creative Commons Attribution 4.0 International
95 stars 45 forks source link

fix: subSystems field #11

Closed BenjaSanchez closed 2 years ago

BenjaSanchez commented 6 years ago

Currently the fields rxnECNumbers & subSystems are retrieved from KEGG & Swissprot, using an automatic script. However, this leads to many cases in which there is more than one match, undesired in the case of subsystems. At some previous version of the Yeast model, this information was indeed present, but someone deleted those fields. @hongzhonglu could you look further into this? The desired data should be available in the original sourceForge repository

BenjaSanchez commented 6 years ago

@demilappa maybe you can shed some light here? I remember you told me you managed to do this once, if so which version of Yeast did you use for it?

demilappa commented 6 years ago

@BenjaSanchez I have added the rxnECNumbers & subSystems to other GEMs of bacterial species with semi-automated scripts, not the yeast one. Unfortunately, for multiple EC Numbers per reaction it involved manual input in case the Subsytems were different.

In my case the source of reconstruction was http://kbase.us/metabolic-modeling-in-kbase/. In their Git repository they already had a preliminary mapping among rxns and external DBs, where for the vast majority of the cases they had soecified compartments. As for the EC Numbers, you can still have multiple ones in your structure.

BenjaSanchez commented 6 years ago

@demilappa thanks for the info! @hongzhonglu as you mentioned that you checked previous versions of the model with no luck, I will close this issue.

hongzhonglu commented 6 years ago

@BenjaSanchez, about two weeks ago, I did some mapping according to the reaction name to obtain the subsystem for each reaction. The model used in the mapping are iMM904, iTO980 and the latest human model Recon3D. All the transport reaction are clarified as transport+compartment. So based on such mapping and the present subsystem in the yeast7.6, I hope each reaction can has a unique subsystem, just like the latest human and E.coli GEMs.

BenjaSanchez commented 6 years ago

@hongzhonglu interesting! how much coverage did you achieve with said mapping? which pathway names are you getting back? from KEGG or another database? I have now re-opened this issue with a changed goal: to eliminate the duplicates in subsystems. Looking forward to your contribution!

hongzhonglu commented 6 years ago

@BenjaSanchez I have updated the subsystem. You can check it now!

BenjaSanchez commented 6 years ago

as discussed in person, a new branch curation/reactions was created for adding these changes

edkerk commented 6 years ago

Discussed today:

edkerk commented 4 years ago

I would like to reignite this issue.

In yeast-GEM, reactions can be annotated to multiple subsystems, and these are predominantly KEGG pathways. Is this desired? This was discussed offline (reported in https://github.com/SysBioChalmers/yeast-GEM/issues/11#issuecomment-399115270), but arguments for this were not recorded.

From the comments above, it seemed there was consensus that single subsystems are preferred, but we ended up with multiple subystems anyway.

A few points to consider:

Having single subSystems and rxnKEGGpathway annotations is the best of both worlds.

Expected feature/value/output:

Single subSystems, e.g. in the case of r_0103, acetyl-CoA C-acetyltransferase; Fatty acid degradation

Current feature/value/output:

Many reactions have multiple subSystems, e.g. in the case of r_0103, acetyl-CoA C-acetyltransferase; sce00071  Fatty acid degradation;sce00072  Synthesis and degradation of ketone bodies;sce00280  Valine, leucine and isoleucine degradation;sce00310  Lysine degradation;sce00380  Tryptophan metabolism;sce00620  Pyruvate metabolism;sce00630  Glyoxylate and dicarboxylate metabolism;sce00640  Propanoate metabolism;sce00650  Butanoate metabolism;sce00900  Terpenoid backbone biosynthesis;sce01110  Biosynthesis of secondary metabolites;sce01130  Biosynthesis of antibiotics;sce01200  Carbon metabolism;sce01212  Fatty acid metabolism

hongzhonglu commented 4 years ago

Hi Ed, @edkerk actually, we now already has single subsystem for each rxns. I like your idea to have "both single subSystems and rxnKEGGpathway annotations".

edkerk commented 4 years ago

@hongzhonglu, that's great, but where can I find those single subsystems? Both in master and devel reactions still have multiple subsystems.

hongzhonglu commented 4 years ago

Hi Ed, @edkerk, i don't yet upload them as we did not know how to handle it before.

BenjaSanchez commented 4 years ago

@edkerk completely agree with moving the current subSystems to a rxnKEGGpathway field, that way we don't loose any information while having better subsystem definition :)

mihai-sysbio commented 4 years ago

Is there any timeline for this change, at least with regards to having this up on a feature branch?

hongzhonglu commented 4 years ago

Hello, @mihai-sysbio I will try to update it this month.

hongzhonglu commented 4 years ago

Hi, @edkerk @BenjaSanchez @feiranl Now I add the manually curated subsystem to the data file of our repo. https://github.com/SysBioChalmers/yeast-GEM/blob/unique_subsystem_of_rxn/ComplementaryData/modelCuration/Rxn_unique_subsystem.tsv.

I think we still need to refine it before we merge it into the model in the followed issues:

  1. Now 99 reactions belong to types of “none” or “other”. They can be put under “Unassigned” like the latest E.coli GEMs.

  2. Another issue is the “slime reaction” in the lipid metablism. Seems strange in the definition as it is not recorded from literature.

BenjaSanchez commented 4 years ago

In order to merge to devel we should first propose to cobratoolbox the addition of rxnKEGGpathway as a new field, @edkerk is it enough to add a new line to src/base/io/definitions/COBRA_structure_fields.csv? I can take care of it, maybe also adding some of the fields discussed in https://github.com/SysBioChalmers/RAVEN/pull/285#issuecomment-624828770

@hongzhonglu answering to your other questions:

  1. Why create a name at all? We could instead leave an empty string for all those cases (AFAIK not all reactions need to be a part of a subsystem).

  2. "SLIME reaction" is added by the SLIMEr formalism, and its definition is available at https://doi.org/10.1186/s12918-018-0673-8.

edkerk commented 4 years ago

@BenjaSanchez It is indeed sufficient to just modify the COBRA_structure_fields.csv file, as done here: https://github.com/opencobra/cobratoolbox/pull/1591.

mihai-sysbio commented 4 years ago

Here are some very specific questions and observations - please excuse my ignorance and silly questions:

edkerk commented 4 years ago

@mihai-sysbio

Here are some very specific questions and observations - please excuse my ignorance and silly questions:

  • the tsv is inconsistent in its use of quotations "

Seems like strings with a comma are quoted. Not sure what was used to write this file.

  • is it okay to have 2 KEGG pathways, ie glycerolipid metabolism / glycerophospholipid metabolism ( sce00561,sce00564 ) for some reactions?

Ideally one.

  • should exchange reaction be exchange reactions, in its plural form?

Either way fine by me

  • is growth a good subsystem name?

"Biomass is perhaps better?"

  • does the transport subsystem really need to be divided ? (eg transport [cytoplasm, golgi_membrane], transport [cytoplasm, golgi] and so on)

Either way fine by me

hongzhonglu commented 4 years ago

I will update it based on all your comments.

BenjaSanchez commented 3 years ago

Update: I have opened https://github.com/SysBioChalmers/yeast-GEM/pull/253 moving the KEGG pathway ids to the proper field. After it is merged, model.subSystems will be ready to get populated with the new group classifications :)

hongzhonglu commented 3 years ago

Nice! I will try to further update the group classifications.

edkerk commented 3 years ago

What is the status of this? I was about to make a few curations to Hongzhong's proposed list of subsytems, but then I noticed that MetabolicAtlas also has subsystems defined . @mihai-sysbio should we try to adhere to the MetabolicAtlas maps whenever possible/suitable? Or will these just be re-drawn when we here have settled the subsystems in yeast-GEM?

edkerk commented 3 years ago

I now noticed the maps on https://github.com/SysBioChalmers/Yeast-maps/, should we try to adhere to these? How have they been defined?

mihai-sysbio commented 3 years ago

What is the status of this?

This is a bit of a thorny issue. I don't think I can remember all the details accurately (please correct me @edkerk @pecholleyc @hongzhonglu):

@mihai-sysbio should we try to adhere to the MetabolicAtlas maps whenever possible/suitable?

With all of the above in mind my suggestion going forward is to add 1 subsystem for each reaction in the yml file. The next step would then be to decide which SVG is best suited as a map for the respective subsystem.

mihai-sysbio commented 3 years ago

I should have mentioned this, but Metabolic Atlas is relying on the subsystem field in the yml populate the reaction-subsystem association. We will likely have to postpone updating the model until this gets sorted out.

edkerk commented 3 years ago

I should have mentioned this, but Metabolic Atlas is relying on the subsystem field in the yml populate the reaction-subsystem association. We will likely have to postpone updating the model until this gets sorted out.

But it makes sense that this information is sourced from subsystem in the yml? And this doesn't affect the maps, right, as they are already defined? I'm not sure what changes there should be made for "this gets sorted out", and how this would affect the subsystem definition?

hongzhonglu commented 3 years ago

Hi all, I am sorry that in past months, i have no time to further curate subsystem used for our model and maps. But I will try to update it in following months. The maps are drawn based on the subsystems, so if the subsystems were updated, the map will be need to be updated also. @edkerk

mihai-sysbio commented 3 years ago

@edkerk I'm following up on your questions below.

But it makes sense that this information is sourced from subsystem in the yml?

I guess we can also change how the reaction to subsytem association is parsed from the yml file, but that's how things work like for the rest of the models on Metabolic Atlas.

And this doesn't affect the maps, right, as they are already defined?

Indeed - the SVG maps are already created and they won't be changed. They would need to be updated, as @hongzhonglu says above, but it will take a long time until that happens, so for the near future we should think that they won't be changed.

I'm not sure what changes there should be made for "this gets sorted out", and how this would affect the subsystem definition?

What is expected is that in the yml file for each reaction to have 1 subsystem, for example:

- reactions:
    - !!omap
      - id: "MAR03905"
      ...
      - subsystem:
          - "Glycolysis / Gluconeogenesis"

To my knowledge, this is part of the normal export to yaml that has been recently included in Raven.

edkerk commented 3 years ago

But it makes sense that this information is sourced from subsystem in the yml?

I guess we can also change how the reaction to subsytem association is parsed from the yml file, but that's how things work like for the rest of the models on Metabolic Atlas.

My intonation probably didn't come across as I intended, I do think it makes sense to do it that way, so I do no propose any changes.

My confusion was based on whether subsystem assignment is actively used to assign reactions to different maps, so that if subsystems would be changed this would directly alter the maps. But I now understand that this is not the case, the maps were once assigned based on subsystems that were specified in that version of the yeast model. Any changes to the subsystems in new model versions do not automatically translate in changes in maps on Metabolic Atlas (the same reactions will still be linked) but if we'd want to have the maps truly reflect the subsystems then it would have to be done manually.

Alright, so just to make sure I understand it correctly, we should:

  1. Assign 1 subsystem per reaction.
  2. If possible/feasible, follow the existing assignments to Metabolic Atlas maps. It will not break anything at Metabolic Atlas if this is not matched, but trying to adhere to Metabolic Atlas maps would make it easier to modify those maps if desired in the future.

And to see the existing mapping of reactions to Metabolic Atlas maps, one should look at the yml of yeast-GEM 8.4.2, and sort out subsytems without a map (for instance prefixed with # here)?

hongzhonglu commented 3 years ago

I agree with the two points @edkerk

mihai-sysbio commented 3 years ago

Very well laid out @edkerk!

modify those maps if desired in the future

On Metabolic Atlas will will look into automatic generation of 2D maps, so perhaps don't put too much weight on the current assignment, as we intend to find a way of always having fully accurate maps.

And to see the existing mapping of reactions to Metabolic Atlas maps, one should look at the yml of yeast-GEM 8.4.2, and sort out subsytems without a map (for instance prefixed with # here)?

That's right. In the case of some existing maps not mapping well with a definition of a subsystem (ie many reactions from other subsystems), there is already support for Custom maps defined in customSVG.tsv. These are maps that will not be matched to a subsystem in the yml file.

hongzhonglu commented 2 years ago

@edkerk Hi Ed, now I added miss subsystem information for the remaining rxns. It will be better to combine this into dev version for further update. In the following, the newly added reactions should have a unique subsystem definition.

edkerk commented 2 years ago

I rebased it to the devel9 branch, is there a function that should apply these changes?

For the subsystems that are based on KEGG pathways maps, this should be removed from the end of the subsystem name (citrate cycle (tca cycle) ( sce00020 ) should be citrate cycle (tca cycle)), as the KEGG pathway IDs (e.g. sce00020) are already be included as kegg.pathway annotation in the model.