Open edkerk opened 6 years ago
Hi, A few things from my point of view:
metComps
as a double Vector might be faster but is more prone to model modifications leading to out of sync fields (I would personally prefer a cell array of strings representing the ID of the compartment instead of the position in the comps vector).rxnMiriams
and similar: I'm missing a qualifier field in the struct. Also, is this a struct array or a single struct with 2 fields. (If the latter, wouldn't a struct array make more sense?)rev
We decided to drop this field mainly because it was used as an indicator for flux reversibility, which is goverened by the flux bounds. I think we could bring it back as an annotation type of information on whether a reaction could be reversible physiologically (e.g. according to data derived from eQuilibrator or similar). But I would really want to avoid this field to indicate flux reversibility, as that information is otherwise redundant and already present in lb
.rxnComps
This field (same as geneComps
) looks odd to me. Reactions can have multiple compartments (e.g. transport), and those are derived from the participating metabolites, and for genes, while yes, their gene product could be localised to a specific compartment, this should be represented by different GPR rules for reactions in different compartments (or am I missing something here)?Oh and one more thing: We recently introduced fields for additional Variables/Constraints to prepare for concepts like Enzyme Capacity constraints (as provided by GECKO). Those fields hold information which is separate from Metabolites/Reactions but interacts with those. While the core structures are now available in the Toolbox, we are waiting for an update in the SBML-FBC package (which should come around some time in the near future) that allows a proper definition of those constraints instead of using artificial metabolites/reactions for them.
metComps
as cell array of strings would work just as well, it would reduce the speed of some function but this is probably negligble, while it is more intuitive and will reduce the risk of mistakes.rxnMiriams
indeed don't have a qualifier field, it has now only been used for annotating reactions and metabolites to databases, so that most of these qualifiers would be of the is
type. Including qualifiers would be no problem for the code. Regarding the format, I'm a bit confused about the exact Matlab terminology, but this is what it looks like:rev
can probably be depricated, but should double-check this in each function that refers to it, although I'm pretty sure that using the lb
value to indicate reversibility will work just as well. As workaround, we could generate the rev
field in the beginning of those functions that depend on it. It could indeed be used as annotation field, but then the current Matlab code should probably still not use it. Might be better to somehow indicate this in the rxnMiriams
?rxnComps
and geneComps
I also don't see the point to have them, happy to leave them out, 99% sure it won't have any effects.Regarding rev
, I would advice to still allow it as an input on RAVEN functions that require it, and only recreate it from lb
/ub
in case it is not present, for backwards compatibility.
For instance, in GECKO we first recreate rev
before calling convertToIrrev.m
, so we can create a backwards rxn for reversible rxns + exchange rxns, the latter even if they have a lb=0
(useful later for simulations). The reasoning here is that not because a rxn is irreversible under a certain simulated condition (reflected in lb
/ub
), means that it will be irreversible under all conditions (reflected in rev
).
@BenjaSanchez in that case, you also don't use the model-provided rev
field, you only recreate it for one specific purpose. So if convertToIrrev
would first recreate rev
by the same criteria as you use now (from lb
= 0 etc.) and the rest of convertToIrrev
is kept as-is, then this function could then still be used largely unchanged for this purpose?
I suppose the question is whether you perform other changes on the model (add reactions etc.) while you want to retain the rev
field. Because that would then mean that addRxns
and similar functions all support the rev
field.
@edkerk yes, however I don't know if e.g. leaving exchange rxns with rev=true
is always desired for any user of convertToIrrev
. In that way, I think it's more sensible to respect their own rev
field if it exists and it makes sense for them. But this is I think only for the case of convertToIrrev
, I do not have strong feelings about rev
in any other way (and I agree that it's redundant with lb
). So alternatively, rev
could just be an elective input for convertToIrrev
for users that require some rxns with lb=0
to still be created in both directions.
@edkerk wrt to xyzMiriams
:
If I get this right, what you have is a cell array of structs where each struct has two fields storing the values. Was there a reason to make this a cell array instead of a struct array?
From a COBRA point of view the biggest issue would be, that we would break backward compatability if people used the individual (defined) annotation fields outside of the toolbox.
@BenjaSanchez I agree that people might want to define rev, but historically, it was used as a model reversibility indicator and, unfortunately is present in old model mat files in exactly this fashion.
This is one of the reasons, why we actually remove it during IO from mat files, and why I would suggest to rename it to something that more clearly indicates, that this is a type of annotation instead of a model property.
If we keep the name rev, people loading old models via load could end up with things they actually don't want.
For convertToIrrev
I think it would anyways be good to add an option that allows either specific (rev vector), or all reactions (single true/false) to be created in both directions irrespective of lower bounds.
@tpfau, xyzMiriams
fields are mainly beneficial for semi-automatic model reconstruction from KEGG (and now also MetaCyc). However, such format may be changed if necessary. My quick idea would be to organise such data as subSystems
for simplicity, but this would need to be thoroughly checked anyway.
These xyzMiriams
fields can be considered as all the annotation which lies in annotation
section in SBML structure. There are several exceptions though, e.g. EC-numbers and InChI strings have their own fields in RAVEN structure, so such information is not included in xyzMiriams
.
I believe that in COBRA-RAVEN consensus model structure it would be beneficial to move the least important annotation to xyzMiriams
. This would allow to drop the hardcoded field names in COBRA structure and would also simplify model I/O and modification functions to add or delete rxns, genes, mets. We just would need to decide which information is important enough to be stored separately in designated fields.
It could also be that the list of COBRA annotation fields is not hardcoded (sorry, I didn't check that). In such case, we can use COBRA annotation fields specification in consensus structure. Then the main problem would be to apply these changes to RAVEN.
@simas232 The 'whats important or not' question is one reason why we don't have a annotation struct. Its simply up to the user what she/he thinks is important and we don't take the decision. For anything we don't know, we will write a kind of descriptive field name, but things we do know will get parsed into specific fields (a list which we are happy to extend, but which needs no changes in the code). An additional argument (not mine) for having the individual fields in our model structure was the aim of a "flat" data structure, i.e. no substructures, and admittedly I would not expect a cell array of (similar) structs but rather a struct array for something like xyzMiriams
.
One important thing to note is that our model modification functions will, in general, automatically detect coupled fields (i.e. undefined fields with the same length as the modified base field), and update those fields accordingly. This allows the use of fields which are not explicitly defined to be kept in sync with the model structure (as long as they don't contain positional information, which we can only handle for known fields like rules). This is also one of the reasons why I would strongly prefer metComps
to contain the IDs instead of the numbers, as a removeCompartment
function would otherwise have to redo the metComps
field to get the numbers back in sync, while it can simply just remove the elements associated with the compartment otherwise i.e. with metComps as ids we would have:
function model = removeCompartment(model,compartment);
rxnsToRemove = findRxnsFromMets(model,model.mets(strcmp(model.metComps,compartment)));
model = removeFieldEntriesForType(model,strcmp(model.comps,compartment));
model = removeRxns(model,rxnsToRemove);
model = removeUnusedGenes(model);
While with metComps being a double array we would need:
function model = removeCompartment(model,compartment);
compPos = find(strcmp(compartment,model.comps));
model = removeFieldEntriesForType(model,strcmp(model.comps,compartment));
rxnsToRemove = findRxnsFromMets(model,model.mets(model.metComps == compPos));
model = removeRxns(model,rxnsToRemove);
model = removeUnusedGenes(model);
% code to adjust the model.metComps field.
posToChange = model.metComps > compPos;
model.metComps(posToChange) = model.metComps(posToChange) - 1
Admittedly this is not too bad but it is something that might be necessary again and again in functions that manipulate compartments, e.g. when merging two models or similar things.
@tpfau, thank you for a comprehensive explanation. I reckon it is okay to abolish numeric metComps
field. A numeric metComps
field is mainly needed by predictLocalization
function, but since there is always only one compartment in the model before running this function, metComps
can be temporarily created and deleted as soon as predictLocalization
is complete. Regarding rxnComps
, well, even though they are arbitrary for e.g. antiport reactions, there is a designated field in SBML structure for it, introduced in Level 3, so why not import such data if it is available in XML file. I don't recall if geneComps
has its field in SBML structure, though.
I did some I/O testing regarding the annotation with the latest COBRA version and was impressed with the results. The annotation fields are not hardcoded at all and whenever you remove anything in the model, all the relevant fields are updated accordingly by removeFieldEntriesForType
function.
Similarly like it is done in COBRA, RAVEN does not check whether the database identifier is defined in identifiers.org, so the only difference between both model structures is the the format in which annotation data is stored. I therefore do support that we abolish xyzMiriams
and use COBRA standard instead.
@edkerk, @Hao-Chalmers, @BenjaSanchez, @IVANDOMENZAIN, @JonathanRob, @danieljcook, do you foresee any potential problems if we discard xyzMiriams
field and use COBRA standard instead? I think it is fine with the model reconstruction module from KEGG.
Regarding
rxnComps
, well, even though they are arbitrary for e.g. antiport reactions, there is a designated field in SBML structure for it, introduced in Level 3, so why not import such data if it is available in XML file. I don't recall ifgeneComps
has its field in SBML structure, though.
I wasn't aware of that field, as I haven't seen it used yet. But well, it doesn't hurt too much (its not very big), to have such a field.
I did some I/O testing regarding the annotation with the latest COBRA version and was impressed with the results. The annotation fields are not hardcoded at all and whenever you remove anything in the model, all the relevant fields are updated accordingly by
removeFieldEntriesForType
function.
Thanks, that was quite some work :)
Similarly like it is done in COBRA, RAVEN does not check whether the database identifier is defined in identifiers.org, so the only difference between both model structures is the the format in which annotation data is stored. I therefore do support that we abolish xyzMiriams and use COBRA standard instead.
We actually do check IDs for defined annotation fields. i.e. were miriam has a spec, we try to use valid IDs. Only for fields which we have not explicitly encoded, this step is skipped.
@simas232 regarding metComps
, it's not about completely abolishing it, it is about turning it into a cell array of strings, instead of numeric vector. So future model.metComps
would be the same as what would now be obtained via model.metComps(model.comps)
.
@tpfau would you then advocate for only using metComps
as compartment identifier, and abandon storing this as part of the mets
field?
Moving towards a cell array of strings for metComps
, I suppose we would retain comps
and compNames
to annotate what the metComps
IDs refer to?
Regarding xyzMiriams
, to my knowledge this is not used by any algorithm in RAVEN, it is just used to store the annotation. So the COBRA standard seems fine, I was not aware that it was so flexible. removeFieldEntriesForType
sounds great! What about the case is multiple annotations can be linked to the same reaction or metabolite? MetaNetX IDs are for instance not unique.
What about the case is multiple annotations can be linked to the same reaction or metabolite? MetaNetX IDs are for instance not unique.
The multiples values could be separated by semicolon, which is used in some fields (e.g. eccodes
, rxnReferences
) and seems suit this scenario.
@tpfau would you then advocate for only using
metComps
as compartment identifier, and abandon storing this as part of themets
field?
Yes, and no.
Since we need distinct identifiers for the individual metabolites the compartment can still be used for distinction. However what I would advocate, and what would require quite a lot of rework on several toolbox functions, is to not use the information in the mets field. I.e. that is purely an ID and not used as an additional piece of information. In my opinion, there should be no knowledge difference between a mets field that reads {'Met1','Met2','Met3'}
and a mets field that reads {'glc_d[e]','fru[c]','pyr[m]'}. All the information on the compound should be stored in fields different to its ID (e.g. database IDs, chemical formulas etc pp).
Moving towards a cell array of strings for
metComps
, I suppose we would retaincomps
andcompNames
to annotate what themetComps
IDs refer to?
Yes, that would be the idea.
Regarding
xyzMiriams
, to my knowledge this is not used by any algorithm in RAVEN, it is just used to store the annotation. So the COBRA standard seems fine, I was not aware that it was so flexible.removeFieldEntriesForType
sounds great! What about the case is multiple annotations can be linked to the same reaction or metabolite? MetaNetX IDs are for instance not unique.
Currently (we could change this), we store muliple IDs by separating them with '; '
, and I hope it will not come up as an issue.
Otherwise we could try to change annotations to second order cell arrays of strings.
@edkerk
wrt removeFieldEntriesForType
, there are a bunch of functions for dynamic model modifications:
removeFieldEntriesForType
- to remove individual positions
updateFieldOrderForType
- Which allows to update the order of elements in a field
extendModelFieldsForType
- Extends all model fields associated with the given type with default values
getModelFieldsForType
- OBtain all fields that are associated with a given base field (rxns, mets, genes etc pp)
Thats perfectly fine with me. One question though: If no version is provided, whats the default we put there? Or do we just not add it. And: How should that version be exported to e.g. SBML. The latter is particularily important, as we cannot necessarily guarantee that IO is not lossless (e.g. if the SBML uses packages we don't support). Currently if there is a model "is" association in SBML, the COBRA toolbox will annotate it as "isDerivedFrom", and only if someone actively sets a 'is' in the model, we assume they know what they do and export as 'is'
If no version is provided, whats the default we put there? Or do we just not add it.
My idea is to make it optional, so just leave it out if no version is provided in the model.
And: How should that version be exported to e.g. SBML.
No clue about this, because SBML spec is out of our hands. In the beginning, we probably just implement this into Matlab model structure when dealing with repo-type GEMs, and push this forward into SBML spec if it turns out a good idea.
There are many benefits in having RAVEN functioning as submodule of the COBRA toolbox. While RAVEN 2.0 is largely compatible with COBRA through use of the
ravenCobraWrapper
function and using unique function names, a complete alignment of model structure between the two toolboxes would be highly preferred.To initiate this process, here are the definitions of the RAVEN fields.
The main discrepancies are:
metComps
field, instead of detailing this in metabolite IDsmetMiriams
and similar fields, instead ofmetKEGGID
and similar fieldsHow should these fields be unified? There is interest from COBRA to move towards the use of the
metComps
field, but what about other discrepancies? What other discrepancies exist?