Closed haowang-bioinfo closed 3 years ago
Is the plan to eventually assign an MA_
ID for non-HMR_
reactions as well?
Yes @JonathanRob, this plan aims toward assigning every rxns
id with a unique MA_
identifier eventually.
Given that the whole process may take some time, it might be better to begin with a simple implementation, just replacing HMR_
with MA_
at first.
Seeing that this change has been in (slow) progress for some time, I'm wondering when it would be a good time to make this change in the entire model. I think the merging of PR #203 would make it easier to implement the adoption of new identifiers.
Another question outside the scope of this issue is if metabolites should also use MA_
identifiers.
@mihai-sysbio honestly, myself look forward to a quick adaption to entire model.
I can see some of the benefits of this transition, but I'm having a hard time spotting downsides besides the obvious backwards compatibility, which can be preserved by moving previous identifiers to the annotation. What other downsides are there?
After discussing with @mihai-sysbio and @JonathanRob, the following consensus was reached regarding this transition:
MA
ids from reactions to metabolites, and probably also compartmentsMA
id is formatted as concatenation of two parts: prefix letters and suffix digitsMAR
for reactions, MAM
for metabolites, MAC
for compartmentsAny comments/suggestions to this proposed naming convention are welcome.
The letters M
and R
are in line with the BiGG recommendations, which is good. I think it's preferable to not have any _
in the identifier though. Should we ignore the special naming scheme for pseudoreactions?
Are there any thoughts of the compartmentalized metabolites, which is the pairing of a compartment and a metabolite?
@mihai-sysbio agree to take away underscore _
.
Now it's probably better to ignore the special naming for pseudoreactions.
I'm not certain about the exact usage of compartment id, which can be implemented "slowly" than the other two.
Due to the breaking of backwards compatibility from this transition, its implementation should be well-planned, i.e. probably associating with a major release.
The current implementation of rxnMAID
is incomplete - not all reactions have such a id. This caused confusion during the transition and alredy impeded the downstream development in Metabolic Atlas.
Here is a proposal to systemically assign an rxnMAID
for each reaction in Human-GEM
:
MA_DDDDD
to MARDDDDD
, where D
represents one digitR
)MAR00001
to MAR09999
@Hao-Chalmers From you comment above, I thought we had agreed to MARDDDDD
for reactions?
The prefix are:
MAR
for reactions,MAM
for metabolites,MAC
for compartments
@JonathanRob yes, the proposal was updated as agreed.
Thinking about these IDs, they appear to be encouraging a cross-model approach (ie the same identifier in multiple models). However, this creates the difficulty of maintaining consistency of these IDs between models, which shouldn't be taken lightly.
Another approach would be to create IDs within the scope of a model (ie MA:Human-GEM.R0001
). The downside here is that mapping across models is more difficult.
I have some thoughts on how we can ensure the consistency so that we can pursue the first option with confidence, but I would value a discussion with you before starting anything.
Specifically about the ID format, I would encourage the separation of the MA
part from the rest of the ID. Also, having looked around on identifiers.org, the _
separator seems a bit uncommon. My preference at the moment is a .
, which would result in something like MA.R00001
.
IMHO, the cross-model approach (using the same identifiers in multiple models) would actually facilitate in maintaining consistency between models. While the approach of creating IDs within the scope of a model does make the mapping a difficult task.
It seems we've had a consensus on the MA rxn id format as MARDDDDD
. I plan to implement as such. @mihai-sysbio @JonathanRob what do you think?
I've had another look on identifiers.org at BiGG reactions, VMH reactions, MNX reactions and MetaCyc reactions. It seems okay, and very much in line with MNX, to follow the format MARDDDDD
proposed above by @Hao-Chalmers, even though my preference is still to separate the domain space like MA.RDDDDD
.
@mihai-sysbio what would be the usage of separating an id into domains?
To me, separating the domain space makes for an easier recognition of what the letters mean, especially down the line with MAM
for metabolites and who knows what else (say MAG
for gene reaction rules or MAC
for complexes).
The advantage of an additional dot is not so clear, because they are mainly recognised by code. In addition, this additional delimiter may cause unexpected effect since these ids could be processed by various packages that treat ids with different regular expression patterns (.
is a special character).
The other option is to completely skip the MA
letters in the identifier name, since identifiers.org has a dedicated prefix. The result would be something like https://identifiers.org/metabolic.atlas:R00001 and other website would refer to it as Metabolic Atlas ID: R00001
.
it seems KEGG took this R00001 format already.
Yes, to avoid confusion with other database IDs, I would keep the MA
prefix within the identifier.
As for the period (MA.R
vs. MAR
), I think that it makes it more clear by separating the MA
prefix from the rest of the identifier, but to be honest I don't have a strong preference either way. I don't believe it should cause parsing issues to have the .
present, since many ID types already include special characters.
Dear all,
Thanks for bringing this up, having a model-specific prefix to avoid confusion and conflicts between models built from same sources sounds like a very good idea.
Regarding the special characters, since the "." is interpreted in regex and sh commands and can add confusion when present in a URI, I guess it might have more inconvenient side effects than _ or nothing.
Sure, it can easily be escaped, but doesn't seem that necessary. The alphabetic characters prefix followed by digit only suffix might already be enough to separate both, if that is ever needed.
Thank you all for the valuable input. For the looks I still prefer .
but even though, like @JonathanRob said, with proper escaping it shouldn't be a problem, let's remove the possibility of situations as described by @cfrainay.
Let's go for MA-R0001
.
The separation is meant increased legibility and to avoid potential confusion (if we imagine a contrived example of a website called Metabolic Atlas Reactants
then R
in a reaction with id MAR0001
can have double meaning: is it Reactants
or reactions
?).
Okay, we end up with the format of MA-RDDDDD
, where D
represents a digit.
Similarly, I would propose MA-MDDDDD
for the metabolite IDs.
It does appear to be reasonable to standardize MA met ids as such.
To make a convenient transition from the previous HMR format, I would recommend to continue appending compartment id as suffix: MA-MDDDDDC
, where C
represent the abbreviation of corresponding compartment. @mihai-sysbio @JonathanRob what do you think?
appending compartment id as suffix: MA-MDDDDDC, where C represent the abbreviation of corresponding compartment
I see that for reactions there is no such last letter C
for the implemented reaction IDs:
rxns | rxnKEGGID | rxnBiGGID | rxnEHMNID | rxnHepatoNET1ID | rxnREACTOMEID | rxnRecon3DID | rxnMetaNetXID | rxnHMR2ID | rxnRatconID | rxnTCDBID | spontaneous | rxnMAID |
---|---|---|---|---|---|---|---|---|---|---|---|---|
"FAOXC101C8m" | "" | "FAOXC101C8m" | "" | "" | "" | "FAOXC101C8m" | "" | "" | "" | "" | 0 | "MA-R04967" |
"FAOXC101C8x" | "" | "FAOXC101C8x" | "" | "" | "" | "FAOXC101C8x" | "" | "" | "" | "" | 0 | "MA-R04968" |
@mihai-sysbio having compartment id as suffix is a common practice for naming met ids, such as in BiGG (ATP). Another advantage is to retain the consistency and inheritance (at least to some extent) with HMR ids as below.
rxns | rxnMAID |
---|---|
"HMR_1967" | "MA-R01967" |
mets | metMAID |
---|---|
"m01967c" | "MA-M01967c" |
True!
The advantage of the other approach is to enable a section like Present in compartments
as here https://identifiers.org/vmhmetabolite:h2o .
Implementation of MA met ids (cdd63f36dd7c55810f89d5695095e1adaeec814c):
m01967c
to MA-M01967c
).@mihai-sysbio @JonathanRob what do you think about this plan?
@mihai-sysbio I just wanted to comment on your note that compartment abbreviations are used in metabolite IDs but not reaction IDs.
Ideally, the reaction and metabolite IDs should not be embedded with compartment information because it implicitly enforces a specific set of compartment abbreviations (i.e., "s" is extracellular, which differs from the COBRA style of using "e" for extracellular). In practice, however, it's more complicated.
I think it makes sense to use different IDs for the same reaction in different compartments (rather than using the same base ID with different compartment abbreviations) because some of the reaction properties may differ. For example, the same reaction may be catalyzed by a different enzyme depending on the compartment, meaning that the gene-reaction rule will differ between the two cases. Similarly, the same reaction taking place in a different species will be catalyzed by a different enzyme (unless it's spontaneous). Finally, the environment of a different compartment (or species) may be such that the reaction reversibility is affected, so the associated lower and/or upper bounds would be affected.
For a metabolite, however, it is the same compound with the same annotations, properties, and associations regardless of what compartment or species it's in. It may change its protonation state due to a pH difference, but then it's treated as a different metabolite with a different ID. So unlike a reaction, a metabolite can be treated as (effectively) identical everywhere it exists. In fact, a metabolite should have the same exact ID even in different compartments, but since GEM standards require that all metabolites in a GEM must have a unique ID, we are forced to differentiate them somehow.
So this is my long way of saying that I'm not really happy with the idea of embedding the compartment abbreviation within the metabolite ID, but to me it still seems (very, very slightly) better than using a completely different ID for different compartment versions of a metabolite.
Thank you both for the ideas and explanations.
In Metabolic Atlas we do distinguish between a Metabolite and a CompartmentalizedMetabolite. The ID format suggested by @Hao-Chalmers would then map directly to the CompartmentalizedMetabolite. I'm still thinking of ways to associate the two.
@JonathanRob nice comments that are very good considerations in identifier formatting, which has been and will be a long-term process.
On the other hand, to have a short-term progress, my feeling is we've reached a consensus with the proposed plan. Am I right? @mihai-sysbio
a consensus with proposed plan
@Hao-Chalmers indeed!
I was wondering if it would make sense to have another column in the metabolites.tsv
that maps to the Metabolite, so that each CompartmentalizedMetabolite row is obviously mapped to the Metabolite. In our case, the compartmentalized ID will be with the compartment letter as you said @Hao-Chalmers (even though it is non-standard as @JonathanRob mentioned), and the non-compartmentalized one will simply be missing that letter. It looks really redundant, but it's also very clear and would require no custom regex parsing.
if it would make sense to have another column with non-compartmentalized ids that will simply be missing that compartment letter
@mihai-sysbio in metabolites.tsv
there is an existing column "metsNoComp" for serving this purpose.
@Hao-Chalmers is the plan to place these new IDs in:
mets
and metsNoComp
columns, ormetMAID
column, ormetMAID
and metNoCompMAID
columns?for now, let's focus on adding metMAID
column first.
I see, so the metsNoComp
then isn't directly useful for the purpose of mapping non-compartmentalized MA ids.
@Hao-Chalmers what are your thoughts on another column metsNoCompMAID
?
@Hao-Chalmers what are your thoughts on another column
metsNoCompMAID
?
My thoughts are:
metsNoCompMAID
is super easy to obtain (just trimming the last character)metsNoCompMAID
would end up as identical column to metsNoComp
once MA ids are used avoid implementations for uncertain requirements (Occam's Razor)
The requirement is for a person to link to all compartmentalized metabolites. This cannot be done at the moment, and it would need a metabolite id, similar to metsNoComp
.
prioritize to certain things
I'm not sure what this means.
metsNoCompMAID
is super easy to obtain (just trimming the last character)
Parsing/regex on identifiers is not a good practice, even if it's easy.
metsNoCompMAID
would end up as identical column tometsNoComp
once MA ids are used
Totally agree. Let's then implement this in 2 stages - only mets
now, and automatically obtain metsNoComp
later.
prioritize to certain things
I'm not sure what this means.
This means that the consensus to metMAID
is certain and can be prioritized.
I believe this issue can be closed.
Description of the issue:
Human-GEM
as a template for generic usagehumanGEMRxnAssoc.JSON
, without affecting the modelPlanned implementation:
rxnMAID
to ~humanGEMRxnAssoc.JSON
~reactions.tsv
rxnMAID
will be the same as those inrxns
, except that the prefixHMR_
will be systematically replaced withMAR
(abbreviation ofMetabolic Atlas reactions
)metMAID
tometabolites.tsv
I hereby confirm that I have:
master
branch of the repository