SysBioChalmers / Human-GEM

The generic genome-scale metabolic model of Homo sapiens
https://sysbiochalmers.github.io/Human-GEM-guide/
Creative Commons Attribution 4.0 International
97 stars 41 forks source link

feat: new sets of id for reactions and metabolites #174

Closed haowang-bioinfo closed 3 years ago

haowang-bioinfo commented 4 years ago

Description of the issue:

Planned implementation:

I hereby confirm that I have:

JonathanRob commented 4 years ago

Is the plan to eventually assign an MA_ ID for non-HMR_ reactions as well?

haowang-bioinfo commented 4 years ago

Yes @JonathanRob, this plan aims toward assigning every rxns id with a unique MA_ identifier eventually.

Given that the whole process may take some time, it might be better to begin with a simple implementation, just replacing HMR_ with MA_ at first.

mihai-sysbio commented 3 years ago

Seeing that this change has been in (slow) progress for some time, I'm wondering when it would be a good time to make this change in the entire model. I think the merging of PR #203 would make it easier to implement the adoption of new identifiers. Another question outside the scope of this issue is if metabolites should also use MA_ identifiers.

haowang-bioinfo commented 3 years ago

@mihai-sysbio honestly, myself look forward to a quick adaption to entire model.

mihai-sysbio commented 3 years ago

I can see some of the benefits of this transition, but I'm having a hard time spotting downsides besides the obvious backwards compatibility, which can be preserved by moving previous identifiers to the annotation. What other downsides are there?

haowang-bioinfo commented 3 years ago

After discussing with @mihai-sysbio and @JonathanRob, the following consensus was reached regarding this transition:

Any comments/suggestions to this proposed naming convention are welcome.

mihai-sysbio commented 3 years ago

The letters M and R are in line with the BiGG recommendations, which is good. I think it's preferable to not have any _ in the identifier though. Should we ignore the special naming scheme for pseudoreactions? Are there any thoughts of the compartmentalized metabolites, which is the pairing of a compartment and a metabolite?

haowang-bioinfo commented 3 years ago

@mihai-sysbio agree to take away underscore _.

Now it's probably better to ignore the special naming for pseudoreactions.

I'm not certain about the exact usage of compartment id, which can be implemented "slowly" than the other two.

haowang-bioinfo commented 3 years ago

Due to the breaking of backwards compatibility from this transition, its implementation should be well-planned, i.e. probably associating with a major release.

haowang-bioinfo commented 3 years ago

The current implementation of rxnMAID is incomplete - not all reactions have such a id. This caused confusion during the transition and alredy impeded the downstream development in Metabolic Atlas.

Here is a proposal to systemically assign an rxnMAID for each reaction in Human-GEM:

  1. Format change: from MA_DDDDD to MARDDDDD, where D represents one digit
  2. Retain the existing ids and follow the new format (replacing underscore with R)
  3. Add values to empty ones by picking ids in the range from MAR00001 to MAR09999
  4. An easy implementation is to make an orderly filling and avoid duplication
JonathanRob commented 3 years ago

@Hao-Chalmers From you comment above, I thought we had agreed to MARDDDDD for reactions?

The prefix are: MAR for reactions, MAM for metabolites, MAC for compartments

haowang-bioinfo commented 3 years ago

@JonathanRob yes, the proposal was updated as agreed.

mihai-sysbio commented 3 years ago

Thinking about these IDs, they appear to be encouraging a cross-model approach (ie the same identifier in multiple models). However, this creates the difficulty of maintaining consistency of these IDs between models, which shouldn't be taken lightly. Another approach would be to create IDs within the scope of a model (ie MA:Human-GEM.R0001). The downside here is that mapping across models is more difficult. I have some thoughts on how we can ensure the consistency so that we can pursue the first option with confidence, but I would value a discussion with you before starting anything.

mihai-sysbio commented 3 years ago

Specifically about the ID format, I would encourage the separation of the MA part from the rest of the ID. Also, having looked around on identifiers.org, the _ separator seems a bit uncommon. My preference at the moment is a ., which would result in something like MA.R00001.

haowang-bioinfo commented 3 years ago

IMHO, the cross-model approach (using the same identifiers in multiple models) would actually facilitate in maintaining consistency between models. While the approach of creating IDs within the scope of a model does make the mapping a difficult task.

haowang-bioinfo commented 3 years ago

It seems we've had a consensus on the MA rxn id format as MARDDDDD. I plan to implement as such. @mihai-sysbio @JonathanRob what do you think?

mihai-sysbio commented 3 years ago

I've had another look on identifiers.org at BiGG reactions, VMH reactions, MNX reactions and MetaCyc reactions. It seems okay, and very much in line with MNX, to follow the format MARDDDDD proposed above by @Hao-Chalmers, even though my preference is still to separate the domain space like MA.RDDDDD.

haowang-bioinfo commented 3 years ago

@mihai-sysbio what would be the usage of separating an id into domains?

mihai-sysbio commented 3 years ago

To me, separating the domain space makes for an easier recognition of what the letters mean, especially down the line with MAM for metabolites and who knows what else (say MAG for gene reaction rules or MAC for complexes).

haowang-bioinfo commented 3 years ago

The advantage of an additional dot is not so clear, because they are mainly recognised by code. In addition, this additional delimiter may cause unexpected effect since these ids could be processed by various packages that treat ids with different regular expression patterns (. is a special character).

mihai-sysbio commented 3 years ago

The other option is to completely skip the MA letters in the identifier name, since identifiers.org has a dedicated prefix. The result would be something like https://identifiers.org/metabolic.atlas:R00001 and other website would refer to it as Metabolic Atlas ID: R00001.

haowang-bioinfo commented 3 years ago

it seems KEGG took this R00001 format already.

JonathanRob commented 3 years ago

Yes, to avoid confusion with other database IDs, I would keep the MA prefix within the identifier.

As for the period (MA.R vs. MAR), I think that it makes it more clear by separating the MA prefix from the rest of the identifier, but to be honest I don't have a strong preference either way. I don't believe it should cause parsing issues to have the . present, since many ID types already include special characters.

cfrainay commented 3 years ago

Dear all,

Thanks for bringing this up, having a model-specific prefix to avoid confusion and conflicts between models built from same sources sounds like a very good idea.

Regarding the special characters, since the "." is interpreted in regex and sh commands and can add confusion when present in a URI, I guess it might have more inconvenient side effects than _ or nothing.

Sure, it can easily be escaped, but doesn't seem that necessary. The alphabetic characters prefix followed by digit only suffix might already be enough to separate both, if that is ever needed.

mihai-sysbio commented 3 years ago

Thank you all for the valuable input. For the looks I still prefer . but even though, like @JonathanRob said, with proper escaping it shouldn't be a problem, let's remove the possibility of situations as described by @cfrainay.

Let's go for MA-R0001.

The separation is meant increased legibility and to avoid potential confusion (if we imagine a contrived example of a website called Metabolic Atlas Reactants then R in a reaction with id MAR0001 can have double meaning: is it Reactants or reactions?).

haowang-bioinfo commented 3 years ago

Okay, we end up with the format of MA-RDDDDD, where D represents a digit.

mihai-sysbio commented 3 years ago

Similarly, I would propose MA-MDDDDD for the metabolite IDs.

haowang-bioinfo commented 3 years ago

It does appear to be reasonable to standardize MA met ids as such.

To make a convenient transition from the previous HMR format, I would recommend to continue appending compartment id as suffix: MA-MDDDDDC, where C represent the abbreviation of corresponding compartment. @mihai-sysbio @JonathanRob what do you think?

mihai-sysbio commented 3 years ago

appending compartment id as suffix: MA-MDDDDDC, where C represent the abbreviation of corresponding compartment

I see that for reactions there is no such last letter C for the implemented reaction IDs:

rxns rxnKEGGID rxnBiGGID rxnEHMNID rxnHepatoNET1ID rxnREACTOMEID rxnRecon3DID rxnMetaNetXID rxnHMR2ID rxnRatconID rxnTCDBID spontaneous rxnMAID
"FAOXC101C8m" "" "FAOXC101C8m" "" "" "" "FAOXC101C8m" "" "" "" "" 0 "MA-R04967"
"FAOXC101C8x" "" "FAOXC101C8x" "" "" "" "FAOXC101C8x" "" "" "" "" 0 "MA-R04968"
haowang-bioinfo commented 3 years ago

@mihai-sysbio having compartment id as suffix is a common practice for naming met ids, such as in BiGG (ATP). Another advantage is to retain the consistency and inheritance (at least to some extent) with HMR ids as below.

rxns rxnMAID
"HMR_1967" "MA-R01967"
mets metMAID
"m01967c" "MA-M01967c"
mihai-sysbio commented 3 years ago

True! The advantage of the other approach is to enable a section like Present in compartments as here https://identifiers.org/vmhmetabolite:h2o .

haowang-bioinfo commented 3 years ago

Implementation of MA met ids (cdd63f36dd7c55810f89d5695095e1adaeec814c):

@mihai-sysbio @JonathanRob what do you think about this plan?

JonathanRob commented 3 years ago

@mihai-sysbio I just wanted to comment on your note that compartment abbreviations are used in metabolite IDs but not reaction IDs.

Ideally, the reaction and metabolite IDs should not be embedded with compartment information because it implicitly enforces a specific set of compartment abbreviations (i.e., "s" is extracellular, which differs from the COBRA style of using "e" for extracellular). In practice, however, it's more complicated.

I think it makes sense to use different IDs for the same reaction in different compartments (rather than using the same base ID with different compartment abbreviations) because some of the reaction properties may differ. For example, the same reaction may be catalyzed by a different enzyme depending on the compartment, meaning that the gene-reaction rule will differ between the two cases. Similarly, the same reaction taking place in a different species will be catalyzed by a different enzyme (unless it's spontaneous). Finally, the environment of a different compartment (or species) may be such that the reaction reversibility is affected, so the associated lower and/or upper bounds would be affected.

For a metabolite, however, it is the same compound with the same annotations, properties, and associations regardless of what compartment or species it's in. It may change its protonation state due to a pH difference, but then it's treated as a different metabolite with a different ID. So unlike a reaction, a metabolite can be treated as (effectively) identical everywhere it exists. In fact, a metabolite should have the same exact ID even in different compartments, but since GEM standards require that all metabolites in a GEM must have a unique ID, we are forced to differentiate them somehow.

So this is my long way of saying that I'm not really happy with the idea of embedding the compartment abbreviation within the metabolite ID, but to me it still seems (very, very slightly) better than using a completely different ID for different compartment versions of a metabolite.

mihai-sysbio commented 3 years ago

Thank you both for the ideas and explanations.

In Metabolic Atlas we do distinguish between a Metabolite and a CompartmentalizedMetabolite. The ID format suggested by @Hao-Chalmers would then map directly to the CompartmentalizedMetabolite. I'm still thinking of ways to associate the two.

haowang-bioinfo commented 3 years ago

@JonathanRob nice comments that are very good considerations in identifier formatting, which has been and will be a long-term process.

On the other hand, to have a short-term progress, my feeling is we've reached a consensus with the proposed plan. Am I right? @mihai-sysbio

mihai-sysbio commented 3 years ago

a consensus with proposed plan

@Hao-Chalmers indeed!

I was wondering if it would make sense to have another column in the metabolites.tsv that maps to the Metabolite, so that each CompartmentalizedMetabolite row is obviously mapped to the Metabolite. In our case, the compartmentalized ID will be with the compartment letter as you said @Hao-Chalmers (even though it is non-standard as @JonathanRob mentioned), and the non-compartmentalized one will simply be missing that letter. It looks really redundant, but it's also very clear and would require no custom regex parsing.

haowang-bioinfo commented 3 years ago

if it would make sense to have another column with non-compartmentalized ids that will simply be missing that compartment letter

@mihai-sysbio in metabolites.tsv there is an existing column "metsNoComp" for serving this purpose.

mihai-sysbio commented 3 years ago

@Hao-Chalmers is the plan to place these new IDs in:

haowang-bioinfo commented 3 years ago

for now, let's focus on adding metMAID column first.

mihai-sysbio commented 3 years ago

I see, so the metsNoComp then isn't directly useful for the purpose of mapping non-compartmentalized MA ids.

mihai-sysbio commented 3 years ago

@Hao-Chalmers what are your thoughts on another column metsNoCompMAID?

haowang-bioinfo commented 3 years ago

@Hao-Chalmers what are your thoughts on another column metsNoCompMAID?

My thoughts are:

mihai-sysbio commented 3 years ago

avoid implementations for uncertain requirements (Occam's Razor)

The requirement is for a person to link to all compartmentalized metabolites. This cannot be done at the moment, and it would need a metabolite id, similar to metsNoComp.

prioritize to certain things

I'm not sure what this means.

metsNoCompMAID is super easy to obtain (just trimming the last character)

Parsing/regex on identifiers is not a good practice, even if it's easy.

metsNoCompMAID would end up as identical column to metsNoComp once MA ids are used

Totally agree. Let's then implement this in 2 stages - only mets now, and automatically obtain metsNoComp later.

haowang-bioinfo commented 3 years ago

prioritize to certain things

I'm not sure what this means.

This means that the consensus to metMAID is certain and can be prioritized.

mihai-sysbio commented 3 years ago

I believe this issue can be closed.