feat: align RAVEN and COBRA model fields

edkerk commented 6 years ago

There are many benefits in having RAVEN functioning as submodule of the COBRA toolbox. While RAVEN 2.0 is largely compatible with COBRA through use of the ravenCobraWrapper function and using unique function names, a complete alignment of model structure between the two toolboxes would be highly preferred.

To initiate this process, here are the definitions of the RAVEN fields.

The main discrepancies are:

RAVEN documents compartments in the metComps field, instead of detailing this in metabolite IDs
RAVEN documents annotations in metMiriams and similar fields, instead of metKEGGID and similar fields

How should these fields be unified? There is interest from COBRA to move towards the use of the metComps field, but what about other discrepancies? What other discrepancies exist?

tpfau commented 6 years ago

Hi, A few things from my point of view:

metComps as a double Vector might be faster but is more prone to model modifications leading to out of sync fields (I would personally prefer a cell array of strings representing the ID of the compartment instead of the position in the comps vector).
rxnMiriams and similar: I'm missing a qualifier field in the struct. Also, is this a struct array or a single struct with 2 fields. (If the latter, wouldn't a struct array make more sense?)
rev We decided to drop this field mainly because it was used as an indicator for flux reversibility, which is goverened by the flux bounds. I think we could bring it back as an annotation type of information on whether a reaction could be reversible physiologically (e.g. according to data derived from eQuilibrator or similar). But I would really want to avoid this field to indicate flux reversibility, as that information is otherwise redundant and already present in lb.
rxnComps This field (same as geneComps) looks odd to me. Reactions can have multiple compartments (e.g. transport), and those are derived from the participating metabolites, and for genes, while yes, their gene product could be localised to a specific compartment, this should be represented by different GPR rules for reactions in different compartments (or am I missing something here)?

tpfau commented 6 years ago

Oh and one more thing: We recently introduced fields for additional Variables/Constraints to prepare for concepts like Enzyme Capacity constraints (as provided by GECKO). Those fields hold information which is separate from Metabolites/Reactions but interacts with those. While the core structures are now available in the Toolbox, we are waiting for an update in the SBML-FBC package (which should come around some time in the near future) that allows a proper definition of those constraints instead of using artificial metabolites/reactions for them.

edkerk commented 6 years ago

metComps as cell array of strings would work just as well, it would reduce the speed of some function but this is probably negligble, while it is more intuitive and will reduce the risk of mistakes.
rxnMiriams indeed don't have a qualifier field, it has now only been used for annotating reactions and metabolites to databases, so that most of these qualifiers would be of the is type. Including qualifiers would be no problem for the code. Regarding the format, I'm a bit confused about the exact Matlab terminology, but this is what it looks like:

rev can probably be depricated, but should double-check this in each function that refers to it, although I'm pretty sure that using the lb value to indicate reversibility will work just as well. As workaround, we could generate the rev field in the beginning of those functions that depend on it. It could indeed be used as annotation field, but then the current Matlab code should probably still not use it. Might be better to somehow indicate this in the rxnMiriams?
regarding rxnComps and geneComps I also don't see the point to have them, happy to leave them out, 99% sure it won't have any effects.

BenjaSanchez commented 6 years ago

Regarding rev, I would advice to still allow it as an input on RAVEN functions that require it, and only recreate it from lb/ub in case it is not present, for backwards compatibility.

For instance, in GECKO we first recreate rev before calling convertToIrrev.m, so we can create a backwards rxn for reversible rxns + exchange rxns, the latter even if they have a lb=0 (useful later for simulations). The reasoning here is that not because a rxn is irreversible under a certain simulated condition (reflected in lb/ub), means that it will be irreversible under all conditions (reflected in rev).

edkerk commented 6 years ago

@BenjaSanchez in that case, you also don't use the model-provided rev field, you only recreate it for one specific purpose. So if convertToIrrev would first recreate rev by the same criteria as you use now (from lb = 0 etc.) and the rest of convertToIrrev is kept as-is, then this function could then still be used largely unchanged for this purpose?

I suppose the question is whether you perform other changes on the model (add reactions etc.) while you want to retain the rev field. Because that would then mean that addRxns and similar functions all support the rev field.

BenjaSanchez commented 6 years ago

@edkerk yes, however I don't know if e.g. leaving exchange rxns with rev=true is always desired for any user of convertToIrrev. In that way, I think it's more sensible to respect their own rev field if it exists and it makes sense for them. But this is I think only for the case of convertToIrrev, I do not have strong feelings about rev in any other way (and I agree that it's redundant with lb). So alternatively, rev could just be an elective input for convertToIrrev for users that require some rxns with lb=0 to still be created in both directions.

tpfau commented 6 years ago

@edkerk wrt to xyzMiriams: If I get this right, what you have is a cell array of structs where each struct has two fields storing the values. Was there a reason to make this a cell array instead of a struct array?
From a COBRA point of view the biggest issue would be, that we would break backward compatability if people used the individual (defined) annotation fields outside of the toolbox.

@BenjaSanchez I agree that people might want to define rev, but historically, it was used as a model reversibility indicator and, unfortunately is present in old model mat files in exactly this fashion. This is one of the reasons, why we actually remove it during IO from mat files, and why I would suggest to rename it to something that more clearly indicates, that this is a type of annotation instead of a model property. If we keep the name rev, people loading old models via load could end up with things they actually don't want. For convertToIrrev I think it would anyways be good to add an option that allows either specific (rev vector), or all reactions (single true/false) to be created in both directions irrespective of lower bounds.

simas232 commented 6 years ago

@tpfau, xyzMiriams fields are mainly beneficial for semi-automatic model reconstruction from KEGG (and now also MetaCyc). However, such format may be changed if necessary. My quick idea would be to organise such data as subSystems for simplicity, but this would need to be thoroughly checked anyway. These xyzMiriams fields can be considered as all the annotation which lies in annotation section in SBML structure. There are several exceptions though, e.g. EC-numbers and InChI strings have their own fields in RAVEN structure, so such information is not included in xyzMiriams. I believe that in COBRA-RAVEN consensus model structure it would be beneficial to move the least important annotation to xyzMiriams. This would allow to drop the hardcoded field names in COBRA structure and would also simplify model I/O and modification functions to add or delete rxns, genes, mets. We just would need to decide which information is important enough to be stored separately in designated fields. It could also be that the list of COBRA annotation fields is not hardcoded (sorry, I didn't check that). In such case, we can use COBRA annotation fields specification in consensus structure. Then the main problem would be to apply these changes to RAVEN.

tpfau commented 6 years ago

@simas232 The 'whats important or not' question is one reason why we don't have a annotation struct. Its simply up to the user what she/he thinks is important and we don't take the decision. For anything we don't know, we will write a kind of descriptive field name, but things we do know will get parsed into specific fields (a list which we are happy to extend, but which needs no changes in the code). An additional argument (not mine) for having the individual fields in our model structure was the aim of a "flat" data structure, i.e. no substructures, and admittedly I would not expect a cell array of (similar) structs but rather a struct array for something like xyzMiriams. One important thing to note is that our model modification functions will, in general, automatically detect coupled fields (i.e. undefined fields with the same length as the modified base field), and update those fields accordingly. This allows the use of fields which are not explicitly defined to be kept in sync with the model structure (as long as they don't contain positional information, which we can only handle for known fields like rules). This is also one of the reasons why I would strongly prefer metComps to contain the IDs instead of the numbers, as a removeCompartment function would otherwise have to redo the metComps field to get the numbers back in sync, while it can simply just remove the elements associated with the compartment otherwise i.e. with metComps as ids we would have:

function model = removeCompartment(model,compartment);
rxnsToRemove = findRxnsFromMets(model,model.mets(strcmp(model.metComps,compartment)));
model = removeFieldEntriesForType(model,strcmp(model.comps,compartment));
model = removeRxns(model,rxnsToRemove);
model = removeUnusedGenes(model);

While with metComps being a double array we would need:

function model = removeCompartment(model,compartment);
compPos = find(strcmp(compartment,model.comps));
model = removeFieldEntriesForType(model,strcmp(model.comps,compartment));
rxnsToRemove = findRxnsFromMets(model,model.mets(model.metComps == compPos));
model = removeRxns(model,rxnsToRemove);
model = removeUnusedGenes(model);
% code to adjust the model.metComps field.
posToChange = model.metComps > compPos;
model.metComps(posToChange) = model.metComps(posToChange) - 1

Admittedly this is not too bad but it is something that might be necessary again and again in functions that manipulate compartments, e.g. when merging two models or similar things.

simas232 commented 6 years ago

@tpfau, thank you for a comprehensive explanation. I reckon it is okay to abolish numeric metComps field. A numeric metComps field is mainly needed by predictLocalization function, but since there is always only one compartment in the model before running this function, metComps can be temporarily created and deleted as soon as predictLocalization is complete. Regarding rxnComps, well, even though they are arbitrary for e.g. antiport reactions, there is a designated field in SBML structure for it, introduced in Level 3, so why not import such data if it is available in XML file. I don't recall if geneComps has its field in SBML structure, though. I did some I/O testing regarding the annotation with the latest COBRA version and was impressed with the results. The annotation fields are not hardcoded at all and whenever you remove anything in the model, all the relevant fields are updated accordingly by removeFieldEntriesForTypefunction. Similarly like it is done in COBRA, RAVEN does not check whether the database identifier is defined in identifiers.org, so the only difference between both model structures is the the format in which annotation data is stored. I therefore do support that we abolish xyzMiriams and use COBRA standard instead.

@edkerk, @Hao-Chalmers, @BenjaSanchez, @IVANDOMENZAIN, @JonathanRob, @danieljcook, do you foresee any potential problems if we discard xyzMiriams field and use COBRA standard instead? I think it is fine with the model reconstruction module from KEGG.

tpfau commented 6 years ago

Regarding rxnComps, well, even though they are arbitrary for e.g. antiport reactions, there is a designated field in SBML structure for it, introduced in Level 3, so why not import such data if it is available in XML file. I don't recall if geneComps has its field in SBML structure, though.

I wasn't aware of that field, as I haven't seen it used yet. But well, it doesn't hurt too much (its not very big), to have such a field.

I did some I/O testing regarding the annotation with the latest COBRA version and was impressed with the results. The annotation fields are not hardcoded at all and whenever you remove anything in the model, all the relevant fields are updated accordingly by removeFieldEntriesForTypefunction.

Thanks, that was quite some work :)

Similarly like it is done in COBRA, RAVEN does not check whether the database identifier is defined in identifiers.org, so the only difference between both model structures is the the format in which annotation data is stored. I therefore do support that we abolish xyzMiriams and use COBRA standard instead.

We actually do check IDs for defined annotation fields. i.e. were miriam has a spec, we try to use valid IDs. Only for fields which we have not explicitly encoded, this step is skipped.

edkerk commented 6 years ago

@simas232 regarding metComps, it's not about completely abolishing it, it is about turning it into a cell array of strings, instead of numeric vector. So future model.metComps would be the same as what would now be obtained via model.metComps(model.comps).

@tpfau would you then advocate for only using metComps as compartment identifier, and abandon storing this as part of the mets field?

Moving towards a cell array of strings for metComps, I suppose we would retain comps and compNames to annotate what the metComps IDs refer to?

Regarding xyzMiriams, to my knowledge this is not used by any algorithm in RAVEN, it is just used to store the annotation. So the COBRA standard seems fine, I was not aware that it was so flexible. removeFieldEntriesForType sounds great! What about the case is multiple annotations can be linked to the same reaction or metabolite? MetaNetX IDs are for instance not unique.

haowang-bioinfo commented 6 years ago

What about the case is multiple annotations can be linked to the same reaction or metabolite? MetaNetX IDs are for instance not unique.

The multiples values could be separated by semicolon, which is used in some fields (e.g. eccodes, rxnReferences) and seems suit this scenario.

tpfau commented 6 years ago

@tpfau would you then advocate for only using metComps as compartment identifier, and abandon storing this as part of the mets field?

Yes, and no. Since we need distinct identifiers for the individual metabolites the compartment can still be used for distinction. However what I would advocate, and what would require quite a lot of rework on several toolbox functions, is to not use the information in the mets field. I.e. that is purely an ID and not used as an additional piece of information. In my opinion, there should be no knowledge difference between a mets field that reads {'Met1','Met2','Met3'} and a mets field that reads {'glc_d[e]','fru[c]','pyr[m]'}. All the information on the compound should be stored in fields different to its ID (e.g. database IDs, chemical formulas etc pp).

Moving towards a cell array of strings for metComps, I suppose we would retain comps and compNames to annotate what the metComps IDs refer to?

Yes, that would be the idea.

Regarding xyzMiriams, to my knowledge this is not used by any algorithm in RAVEN, it is just used to store the annotation. So the COBRA standard seems fine, I was not aware that it was so flexible. removeFieldEntriesForType sounds great! What about the case is multiple annotations can be linked to the same reaction or metabolite? MetaNetX IDs are for instance not unique.

Currently (we could change this), we store muliple IDs by separating them with '; ', and I hope it will not come up as an issue. Otherwise we could try to change annotations to second order cell arrays of strings.

tpfau commented 6 years ago

@edkerk wrt removeFieldEntriesForType, there are a bunch of functions for dynamic model modifications: removeFieldEntriesForType - to remove individual positions updateFieldOrderForType - Which allows to update the order of elements in a field extendModelFieldsForType - Extends all model fields associated with the given type with default values getModelFieldsForType - OBtain all fields that are associated with a given base field (rxns, mets, genes etc pp)

haowang-bioinfo commented 5 years ago

I'd like to recommend an optional filed version to this upcoming unified model structure. It is new, but should be super helpful for storing the model version (e.g. 8.3.3) to the ones that have been deposited and curated as public repositories (e.g. yeast-GEM, Sco4).

tpfau commented 5 years ago

Thats perfectly fine with me. One question though: If no version is provided, whats the default we put there? Or do we just not add it. And: How should that version be exported to e.g. SBML. The latter is particularily important, as we cannot necessarily guarantee that IO is not lossless (e.g. if the SBML uses packages we don't support). Currently if there is a model "is" association in SBML, the COBRA toolbox will annotate it as "isDerivedFrom", and only if someone actively sets a 'is' in the model, we assume they know what they do and export as 'is'

haowang-bioinfo commented 5 years ago

If no version is provided, whats the default we put there? Or do we just not add it.

My idea is to make it optional, so just leave it out if no version is provided in the model.

And: How should that version be exported to e.g. SBML.

No clue about this, because SBML spec is out of our hands. In the beginning, we probably just implement this into Matlab model structure when dealing with repo-type GEMs, and push this forward into SBML spec if it turns out a good idea.

SysBioChalmers / RAVEN

feat: align RAVEN and COBRA model fields #184