Open haowang-bioinfo opened 3 years ago
In principle I support this, but please note the following:
importModel
is mentioning the "yeast consensus network model", in essence it is about compatibility with COBRA toolbox. These prefixes used to be part of their "model format" (AFAIK), but they still use it if an ID starts with a number.letter ::= ’a’..’z’,’A’..’Z’
digit ::= ’0’..’9’
idChar ::= letter | digit | ’_’
SId ::= ( letter | ’_’ ) idChar*
It appears that the COBRA function writeSBML
adds prefixes only when ID starts with number. While the exportModel
in RAVEN applies prefixes to all IDs regardless of their format. Therefore, the prefixes added by COBRA is to avoid downstream errors, and it doesn't change any ids if they are valid. This means that COBRA has no special behaviour in formatting ids but error fixing.
Yes, but when COBRA then reads such a model, it removes the prefixes. So when loaded in COBRA there are never such prefixes, but the SBML file can have such prefixes if required. With the proposed changes in importModel
, you break this compatibility.
Scenario 1:
Model in MATLAB contains a metabolite with ID 3pg[c]
. When exported with COBRA's writeCbModel
, the SBML file uses the ID M_3pg__91__c__93__
. When exported with RAVEN's exportModel
using the behaviour proposed in your first post, this ends up with an SBML file with 3pg[c]
which is not allowed end results in an invalid SBML file.
Scenario 2:
If the SBML file with M_3pg__91__c__93__
is loaded with COBRA's readCbModel
, the ID in the MATLAB model would again be 3pg_c
, while loading the model with RAVEN's importModel
using the behaviour proposed in your first post, the ID in the MATLAB model would be M_3pg__91__c__93__
.
While scenario 2 breaks compatibility between COBRA and RAVEN by design, as RAVEN will more strictly adhere to the libSBML desire not to use such prefixes as you rightly mentioned in your first page. ravenCobraWrapper
could then also be coded to convert these differences between the two model structures. The point 2 I raised earlier, whether keeping the option to force the importModel
and exportModel
functions to practice the old behaviour, I can see the arguments for both enabling this and not enabling this, what is your opinion on this specifically?
Scenario 1 however, there you want to prevent writing the invalid SBML file in the first case. While something like validateSBML
could be used to check the SBML after writing it, it would make more sense to have a quick check whether the IDs are valid before writing the actual SBML file. So the point 4 I raised, that such a check of IDs should be implemented, still stands. This should not be complex, just a few regular expressions that can be run on all IDs, it will only take a few lines of code.
@edkerk thanks for the detail explanations! I don't have an immediate solution the compatibility issue, and hope to get an ideal way through further discussion.
IMHO:
libSBML
and the definition may change with timelibSBML
for error reportingWhile scenario 2 breaks compatibility between COBRA and RAVEN by design, as RAVEN will more strictly adhere to the libSBML desire not to use such prefixes as you rightly mentioned in your first page. ravenCobraWrapper could then also be coded to convert these differences between the two model structures.
This sounds a good idea.
The point 2 I raised earlier, whether keeping the option to force the importModel and exportModel functions to practice the old behaviour, I can see the arguments for both enabling this and not enabling this, what is your opinion on this specifically?
My opinion leans toward not enabling the old behaviour (i.e. break the compatibility).
Scenario 1 however, there you want to prevent writing the invalid SBML file in the first case. While something like validateSBML could be used to check the SBML after writing it, it would make more sense to have a quick check whether the IDs are valid before writing the actual SBML file. So the point 4 I raised, that such a check of IDs should be implemented, still stands. This should not be complex, just a few regular expressions that can be run on all IDs, it will only take a few lines of code.
The idea of having a few regular expressions check whether the IDs are valid
appears to be a pragmatic solution for now.
To keep the discussion mostly here, I have the following comment on #356, where currently only exportModel
is changed:
What about importModel
? A round of I/O (importModel
-> exportModel
) of a COBRA-style SBML file will now change all IDs:
ID in SBML file ---[ current importModel ]---> ID in MATLAB ---[ new exportModel ]---> ID in SBML file
R_hxk ---[ current importModel ]---> hxk ---[ new exportModel ]---> hxk
Above (#353), it was discussed to also deprecate the behaviour of importModel
that currently removes R_
and similar prefixes. This would result in:
R_hxk ---[ new importModel ]---> R_hxk ---[ new exportModel ]---> R_hxk
but also
hxk ---[ new importModel ]---> hxk ---[ new exportModel ]---> hxk
So changing both importModel
and exportModel
are required, according to:
- the import/export functions should neither modify/reformat ids nor correct errors
In addition, it would be good to have a checkIDvalidity
function (or other name?) that runs a few regexp on model.mets
, model.rxns
and model.comps
, to ensure that they adhere to the SBML specification SId ::= ( letter | ’_’ ) idChar*
.
In addition, it would be good to have a
checkIDvalidity
function (or other name?) that runs a few regexp onmodel.mets
,model.rxns
andmodel.comps
, to ensure that they adhere to the SBML specificationSId ::= ( letter | ’_’ ) idChar*
.
Yes!
My 2 cents: Cobra's design choice to modify the identifiers is not optimal and is affecting everyone, not just us. Rather than worrying about maintaining Raven's compatibility with Cobra, Cobra's design choice should be reported and corrected at source.
My 2 cents: Cobra's design choice to modify the identifiers is not optimal and is affecting everyone, not just us. Rather than worrying about maintaining Raven's compatibility with Cobra, Cobra's design choice should be reported and corrected at source.
I agree, so let's be consistent then and deprecate the identifier modifications by importModel
as well.
Agree to deprecate the identifier modifications both in importModel
and exportModel
In addition, it would be good to have a checkIDvalidity function (or other name?) that runs a few regexp on model.mets, model.rxns and model.comps, to ensure that they adhere to the SBML specification SId ::= ( letter | ’_’ ) idChar*.
Besides model.mets
, model.rxns
, and model.comps
, what else fields should also be checked? Should the check be made by a new function (e.g. checkIDValidity
) or included into the existing checkModelStruct.m
function? @edkerk @simas232
We should also check model.genes
as I saw some rare examples a couple of years ago of genes having "G_" in the beginning.
This functionality seems to be a little bit out of scope from checkModelStruct
function, but I suggest we do not create any new functions anyway.
To my knowledge that would indeed cover all cases, model.mets
, model.rxns
, model.comps
and model.genes
(I think COBRA Toolbox still adds the G_
prefix?).
I like the idea of incorporating it in checkModelStruct
, just make sure that exportModel
then always runs checkModelStruct
, as this is not the default currently:
https://github.com/SysBioChalmers/RAVEN/blob/53fe2b700f86dce9b46be70f097cc7f898b509e9/io/exportModel.m#L11-L12 https://github.com/SysBioChalmers/RAVEN/blob/53fe2b700f86dce9b46be70f097cc7f898b509e9/io/exportModel.m#L64-L66
The idea of having a few regular expressions check whether the IDs are valid appears to be a pragmatic solution for now.
A section for validating id format with regular expressions is added to checkModelStruct
by commit 66f5a0ee7c3d40fb9e26c09c3b0f72c2612b672c.
Next is to enable checkModelStruct
by default in exportModel
, probably also in importModel
? @edkerk @simas232
Not sure if it is necessary for importModel
, as invalid SBML would have resulted in an error when running importModel
.
Will try out the new commit soon.
A more recent concern I'm having is that many models use BiGG IDs that can start with digits. Considering that this PR would follow the same philosphy of cobrapy (do not change identifiers), how are metabolite IDs represented in cobrapy if the model uses BiGG IDs as metabolite identifiers? Or actually, how are BiGG IDs represented in those SBML files? They must have some prefix.
edit: cobrapy also uses prefixes with I/O of SBML files (although not with yml or JSON!). The prefix is really a COBRA-style SBML "feature".
Now, if one would load a YML whose identifiers are BiGG IDs (e.g. 6pgc), this model cannot be exported to SBML as 6pgc
is not a valid identifier and exportModel
does not add a prefix. I/O of YML is something we want to support, as part of a YML-based curation pipeline as advocated by human-GEM.
So, what if SBML files are still I/O in agreement with the COBRA-style prefixes? Meanwhile, I/O to yml and export to .txt or .xlsx do not add/remove prefixes. Now, the model in MATLAB has the same identifiers as in the yml, txt and xlsx files, while the SBML file is in COBRA-style. Now, the SBML file is handled identically by COBRA and cobrapy.
There are many different options to handle the exceptions introduced by SBML. For me, uniformity and absence of surprises are important. This is why I really like that:
I/O to yml and export to .txt or .xlsx do not add/remove prefixes
the model in MATLAB has the same identifiers as in the yml, txt and xlsx files
We already know from MetaNetX that prefixed IDs in SBML when they are not needed is bad. Moving forward we would at least have to make this conditional.
I'm tempted to go as far as completely dropping support for prefixing model components. For the cases when users need to have prefixes like in the aforementioned BiGG models, they could a) do this themselves or b) do the SBML export with cobratoolbox.
I'm tempted to go as far as completely dropping support for prefixing model components. For the cases when users need to have prefixes like in the aforementioned BiGG models, they could a) do this themselves or b) do the SBML export with cobratoolbox.
We can then have default I/O behaviour to leave identifiers untouched, but include a COBRAstyle
flag to replicate COBRA's behaviour of prefix addition/removal.
Regardless, if we change the default behaviour of importModel
and exportModel
, this might still be best handled as major release due to breaking backwards compatibility.
Instead of COBRAstyle
as a parameter, how about the following scenario: exporting an incompatible model crashes exportModel
with a message asking the user to run a function called makeIdsSBMLFriendly
, which renames all the broken ids in the model. This would leave the import/export untouched, keeping the functions modular. Imho an even better approach would be to trigger the same warning at the import stage as well.
Instead of
COBRAstyle
as a parameter, how about the following scenario: exporting an incompatible model crashesexportModel
with a message asking the user to run a function calledmakeIdsSBMLFriendly
, which renames all the broken ids in the model.
But how will it rename those identifiers to make them valid SBML? Will it add the prefixes? In that case, wouldn't it make more sense to then add these prefixes all metabolites/reactions/genes, and basically follow COBRA-style? Would be weird to end up with some hybrid version.
What if checkModelStruct
(that runs in the beginning of exportModel
) not only throws a error when it finds invalid identifiers, but let it also suggest to either fix the identifiers manually, or alternatively save the model in COBRA format run addIdentifierPrefix
, which will add M_
, R_
and G_
to metabolites, reactions and genes? I'm not a fan of that either, because then you'll end up with a model in MATLAB with those prefixes attached, but we only want those prefixes to appear in the SBML. Adding a COBRAstyle
flag to importModel
/exportModel
seems like a sensible way to do this? The prefix removal and addition should then not be done as it was previously, but just at either the beginning (exportModel) or end (importModel) of the function.
What if checkModelStruct (that runs in the beginning of exportModel) not only throws a error when it finds invalid identifiers, but let it also suggest to either fix the identifiers manually, or alternatively save the model in COBRA format run addIdentifierPrefix, which will add M, R and G_ to metabolites, reactions and genes? I'm not a fan of that either, because then you'll end up with a model in MATLAB with those prefixes attached, but we only want those prefixes to appear in the SBML. Adding a COBRAstyle flag to importModel/exportModel seems like a sensible way to do this? The prefix removal and addition should then not be done as it was previously, but just at either the beginning (exportModel) or end (importModel) of the function.
I think this is a pragmatic solution to the long-hanging #360.
The occurrence of IDs starting with number (e.g. BiGG met ids) is rather common, and has to be adapted to fit SBML scheme in order to improve the usability of exportModel
. Look forward to the implementation as you proposed @edkerk
Description of the issue:
M_
andR_
are added byexprotModel
function while exporting, and removed byimportModel
while importing. This was initially designed to follow the "yeast consensus network model", which however had been replaced byyeast-GEM
that is no longer exactly using these prefixes.I hereby confirm that I have: