refactor: remove prefix 'M_' and 'R_' in exprotModel/importModel functions

haowang-bioinfo commented 3 years ago

Description of the issue:

Prefixes M_ and R_ are added by exprotModel function while exporting, and removed by importModel while importing. This was initially designed to follow the "yeast consensus network model", which however had been replaced by yeast-GEM that is no longer exactly using these prefixes.
Here it is proposed to remove these prefixes from both functions so that the original identifiers in a model can be presented in exported SBML file and avoid confusion.
In addition, these prefixes are not encouraged by libSBML scheme.

I hereby confirm that I have:

[X] Followed the guidelines to install RAVEN.
[X] Checked that a similar issue does not already exist
[ ] If suitable, needed, asked first in the Gitter chat room about the issue

edkerk commented 3 years ago

In principle I support this, but please note the following:

While documentation in importModel is mentioning the "yeast consensus network model", in essence it is about compatibility with COBRA toolbox. These prefixes used to be part of their "model format" (AFAIK), but they still use it if an ID starts with a number.
With the above in mind, to keep compatibility with COBRA one might want to have the option to keep the current behaviour? So by default follow your suggestion here, but keep the option to I/O as COBRA model. What is your opinion?
The above code also raises the issue that identifiers that are defined in the model might not be valid SBML IDs. The SBML specification states the following (page 12), which basically means that an ID starts with either a letter or underscore, and the rest of the ID can be a combination of only letters, numbers and underscores, no other characters are allowed:

letter ::= ’a’..’z’,’A’..’Z’
digit ::= ’0’..’9’
idChar ::= letter | digit | ’_’
SId ::= ( letter | ’_’ ) idChar*

With the above in mind, it would be good to write a function that checks whether illegal IDs exist in the model when loaded in MATLAB. This function should run when the model is exported, and should then throw an error. Might be good to have it as standalone function so you can also run it without exporting the model.

haowang-bioinfo commented 3 years ago

It appears that the COBRA function writeSBML adds prefixes only when ID starts with number. While the exportModel in RAVEN applies prefixes to all IDs regardless of their format. Therefore, the prefixes added by COBRA is to avoid downstream errors, and it doesn't change any ids if they are valid. This means that COBRA has no special behaviour in formatting ids but error fixing.

edkerk commented 3 years ago

Yes, but when COBRA then reads such a model, it removes the prefixes. So when loaded in COBRA there are never such prefixes, but the SBML file can have such prefixes if required. With the proposed changes in importModel, you break this compatibility.

Scenario 1: Model in MATLAB contains a metabolite with ID 3pg[c]. When exported with COBRA's writeCbModel, the SBML file uses the ID M_3pg__91__c__93__. When exported with RAVEN's exportModel using the behaviour proposed in your first post, this ends up with an SBML file with 3pg[c] which is not allowed end results in an invalid SBML file.

Scenario 2: If the SBML file with M_3pg__91__c__93__ is loaded with COBRA's readCbModel, the ID in the MATLAB model would again be 3pg_c, while loading the model with RAVEN's importModel using the behaviour proposed in your first post, the ID in the MATLAB model would be M_3pg__91__c__93__.

While scenario 2 breaks compatibility between COBRA and RAVEN by design, as RAVEN will more strictly adhere to the libSBML desire not to use such prefixes as you rightly mentioned in your first page. ravenCobraWrapper could then also be coded to convert these differences between the two model structures. The point 2 I raised earlier, whether keeping the option to force the importModel and exportModel functions to practice the old behaviour, I can see the arguments for both enabling this and not enabling this, what is your opinion on this specifically?

Scenario 1 however, there you want to prevent writing the invalid SBML file in the first case. While something like validateSBML could be used to check the SBML after writing it, it would make more sense to have a quick check whether the IDs are valid before writing the actual SBML file. So the point 4 I raised, that such a check of IDs should be implemented, still stands. This should not be complex, just a few regular expressions that can be run on all IDs, it will only take a few lines of code.

haowang-bioinfo commented 3 years ago

@edkerk thanks for the detail explanations! I don't have an immediate solution the compatibility issue, and hope to get an ideal way through further discussion.

IMHO:

the import/export functions should neither modify/reformat ids nor correct errors
these functions should report errors whenever possible
it is the job of modellers to keep the IDs correctly formatted
the validity of IDs is defined by libSBML and the definition may change with time
it is preferred to use consistency check provided by libSBML for error reporting

haowang-bioinfo commented 3 years ago

While scenario 2 breaks compatibility between COBRA and RAVEN by design, as RAVEN will more strictly adhere to the libSBML desire not to use such prefixes as you rightly mentioned in your first page. ravenCobraWrapper could then also be coded to convert these differences between the two model structures.

This sounds a good idea.

The point 2 I raised earlier, whether keeping the option to force the importModel and exportModel functions to practice the old behaviour, I can see the arguments for both enabling this and not enabling this, what is your opinion on this specifically?

My opinion leans toward not enabling the old behaviour (i.e. break the compatibility).

Scenario 1 however, there you want to prevent writing the invalid SBML file in the first case. While something like validateSBML could be used to check the SBML after writing it, it would make more sense to have a quick check whether the IDs are valid before writing the actual SBML file. So the point 4 I raised, that such a check of IDs should be implemented, still stands. This should not be complex, just a few regular expressions that can be run on all IDs, it will only take a few lines of code.

The idea of having a few regular expressions check whether the IDs are valid appears to be a pragmatic solution for now.

edkerk commented 3 years ago

To keep the discussion mostly here, I have the following comment on #356, where currently only exportModel is changed:

What about importModel? A round of I/O (importModel -> exportModel) of a COBRA-style SBML file will now change all IDs:

ID in SBML file ---[ current importModel ]---> ID in MATLAB ---[ new exportModel ]---> ID in SBML file
R_hxk           ---[ current importModel ]---> hxk          ---[ new exportModel ]---> hxk

Above (#353), it was discussed to also deprecate the behaviour of importModel that currently removes R_ and similar prefixes. This would result in:

R_hxk           ---[ new importModel ]---> R_hxk        ---[ new exportModel ]---> R_hxk

but also

hxk             ---[ new importModel ]---> hxk          ---[ new exportModel ]---> hxk

So changing both importModel and exportModel are required, according to:

the import/export functions should neither modify/reformat ids nor correct errors

In addition, it would be good to have a checkIDvalidity function (or other name?) that runs a few regexp on model.mets, model.rxns and model.comps, to ensure that they adhere to the SBML specification SId ::= ( letter | ’_’ ) idChar*.

mihai-sysbio commented 3 years ago

In addition, it would be good to have a checkIDvalidity function (or other name?) that runs a few regexp on model.mets, model.rxns and model.comps, to ensure that they adhere to the SBML specification SId ::= ( letter | ’_’ ) idChar*.

Yes!

My 2 cents: Cobra's design choice to modify the identifiers is not optimal and is affecting everyone, not just us. Rather than worrying about maintaining Raven's compatibility with Cobra, Cobra's design choice should be reported and corrected at source.

edkerk commented 3 years ago

My 2 cents: Cobra's design choice to modify the identifiers is not optimal and is affecting everyone, not just us. Rather than worrying about maintaining Raven's compatibility with Cobra, Cobra's design choice should be reported and corrected at source.

I agree, so let's be consistent then and deprecate the identifier modifications by importModel as well.

haowang-bioinfo commented 3 years ago

Agree to deprecate the identifier modifications both in importModel and exportModel

In addition, it would be good to have a checkIDvalidity function (or other name?) that runs a few regexp on model.mets, model.rxns and model.comps, to ensure that they adhere to the SBML specification SId ::= ( letter | ’_’ ) idChar*.

Besides model.mets, model.rxns, and model.comps, what else fields should also be checked? Should the check be made by a new function (e.g. checkIDValidity) or included into the existing checkModelStruct.m function? @edkerk @simas232

simas232 commented 3 years ago

We should also check model.genes as I saw some rare examples a couple of years ago of genes having "G_" in the beginning.

This functionality seems to be a little bit out of scope from checkModelStruct function, but I suggest we do not create any new functions anyway.

edkerk commented 3 years ago

To my knowledge that would indeed cover all cases, model.mets, model.rxns, model.comps and model.genes (I think COBRA Toolbox still adds the G_ prefix?).

I like the idea of incorporating it in checkModelStruct, just make sure that exportModel then always runs checkModelStruct, as this is not the default currently:

https://github.com/SysBioChalmers/RAVEN/blob/53fe2b700f86dce9b46be70f097cc7f898b509e9/io/exportModel.m#L11-L12 https://github.com/SysBioChalmers/RAVEN/blob/53fe2b700f86dce9b46be70f097cc7f898b509e9/io/exportModel.m#L64-L66

haowang-bioinfo commented 3 years ago

The idea of having a few regular expressions check whether the IDs are valid appears to be a pragmatic solution for now.

A section for validating id format with regular expressions is added to checkModelStruct by commit 66f5a0ee7c3d40fb9e26c09c3b0f72c2612b672c.

Next is to enable checkModelStruct by default in exportModel, probably also in importModel? @edkerk @simas232

edkerk commented 3 years ago

Not sure if it is necessary for importModel, as invalid SBML would have resulted in an error when running importModel.

Will try out the new commit soon.

edkerk commented 2 years ago

A more recent concern I'm having is that many models use BiGG IDs that can start with digits. ~~Considering that this PR would follow the same philosphy of cobrapy (do not change identifiers),~~ how are metabolite IDs represented in cobrapy if the model uses BiGG IDs as metabolite identifiers? Or actually, how are BiGG IDs represented in those SBML files? They must have some prefix.

edit: cobrapy also uses prefixes with I/O of SBML files (although not with yml or JSON!). The prefix is really a COBRA-style SBML "feature".

Now, if one would load a YML whose identifiers are BiGG IDs (e.g. 6pgc), this model cannot be exported to SBML as 6pgc is not a valid identifier and exportModel does not add a prefix. I/O of YML is something we want to support, as part of a YML-based curation pipeline as advocated by human-GEM.

So, what if SBML files are still I/O in agreement with the COBRA-style prefixes? Meanwhile, I/O to yml and export to .txt or .xlsx do not add/remove prefixes. Now, the model in MATLAB has the same identifiers as in the yml, txt and xlsx files, while the SBML file is in COBRA-style. Now, the SBML file is handled identically by COBRA and cobrapy.

mihai-sysbio commented 2 years ago

There are many different options to handle the exceptions introduced by SBML. For me, uniformity and absence of surprises are important. This is why I really like that:

I/O to yml and export to .txt or .xlsx do not add/remove prefixes

the model in MATLAB has the same identifiers as in the yml, txt and xlsx files

We already know from MetaNetX that prefixed IDs in SBML when they are not needed is bad. Moving forward we would at least have to make this conditional.

I'm tempted to go as far as completely dropping support for prefixing model components. For the cases when users need to have prefixes like in the aforementioned BiGG models, they could a) do this themselves or b) do the SBML export with cobratoolbox.

edkerk commented 2 years ago

I'm tempted to go as far as completely dropping support for prefixing model components. For the cases when users need to have prefixes like in the aforementioned BiGG models, they could a) do this themselves or b) do the SBML export with cobratoolbox.

We can then have default I/O behaviour to leave identifiers untouched, but include a COBRAstyle flag to replicate COBRA's behaviour of prefix addition/removal.

Regardless, if we change the default behaviour of importModel and exportModel, this might still be best handled as major release due to breaking backwards compatibility.

mihai-sysbio commented 2 years ago

Instead of COBRAstyle as a parameter, how about the following scenario: exporting an incompatible model crashes exportModel with a message asking the user to run a function called makeIdsSBMLFriendly, which renames all the broken ids in the model. This would leave the import/export untouched, keeping the functions modular. Imho an even better approach would be to trigger the same warning at the import stage as well.

edkerk commented 2 years ago

Instead of COBRAstyle as a parameter, how about the following scenario: exporting an incompatible model crashes exportModel with a message asking the user to run a function called makeIdsSBMLFriendly, which renames all the broken ids in the model.

But how will it rename those identifiers to make them valid SBML? Will it add the prefixes? In that case, wouldn't it make more sense to then add these prefixes all metabolites/reactions/genes, and basically follow COBRA-style? Would be weird to end up with some hybrid version.

What if checkModelStruct (that runs in the beginning of exportModel) not only throws a error when it finds invalid identifiers, but let it also suggest to either fix the identifiers manually, or alternatively save the model in COBRA format run addIdentifierPrefix, which will add M_, R_ and G_ to metabolites, reactions and genes? I'm not a fan of that either, because then you'll end up with a model in MATLAB with those prefixes attached, but we only want those prefixes to appear in the SBML. Adding a COBRAstyle flag to importModel/exportModel seems like a sensible way to do this? The prefix removal and addition should then not be done as it was previously, but just at either the beginning (exportModel) or end (importModel) of the function.

haowang-bioinfo commented 2 years ago

What if checkModelStruct (that runs in the beginning of exportModel) not only throws a error when it finds invalid identifiers, but let it also suggest to either fix the identifiers manually, or alternatively save the model in COBRA format run addIdentifierPrefix, which will add M, R and G_ to metabolites, reactions and genes? I'm not a fan of that either, because then you'll end up with a model in MATLAB with those prefixes attached, but we only want those prefixes to appear in the SBML. Adding a COBRAstyle flag to importModel/exportModel seems like a sensible way to do this? The prefix removal and addition should then not be done as it was previously, but just at either the beginning (exportModel) or end (importModel) of the function.

I think this is a pragmatic solution to the long-hanging #360.

The occurrence of IDs starting with number (e.g. BiGG met ids) is rather common, and has to be adapted to fit SBML scheme in order to improve the usability of exportModel. Look forward to the implementation as you proposed @edkerk

SysBioChalmers / RAVEN

refactor: remove prefix 'M_' and 'R_' in exprotModel/importModel functions #353

Description of the issue: