SysBioChalmers / yeast-GEM

The consensus GEM for Saccharomyces cerevisiae
http://sysbiochalmers.github.io/yeast-GEM/
Creative Commons Attribution 4.0 International
94 stars 43 forks source link

fix: compatibility with memote #106

Closed BenjaSanchez closed 6 years ago

BenjaSanchez commented 6 years ago

Description of the issue:

We would like to include memote as provider of QC, as we are about to include many new genes/reactions/mets, and it would be best to see how the model changes as we include them. In order to do that we need to:

Please comment here if anyone finds more problems using memote @hongzhonglu @feiranl @edkerk @ChristianLieven @Midnighter

I hereby confirm that I have:

Midnighter commented 6 years ago

Super excited to hear that and to soon see memote in action with an important model. Some comments:

No SBO terms detected for metabolites/reactions/genes. This is probably leading to some of the other problems, so should be fixed first.

Reading of SBO terms will be fixed when https://github.com/opencobra/cobrapy/pull/685 gets merged. I'm trying to wrap up the open PRs on cobrapy but I can't give you an ETA. It will also require memote to know the full SB ontology which it currently doesn't. So maybe those features can land at the same time.

Apart from that we still need to make the memote report a bit more resilient so that errors when they do occur do not break the formatting.

Only 1 unique metabolite

Very curious, we'll have to look into that.

Model identifier is "COBRAmodel", should be "yeastGEM"

Funny, probably also related to the SBML parser.

BenjaSanchez commented 6 years ago

@Midnighter solving issue #19 actually solved many of the erroring tests. Commenting on some remaining ones:

only 1 unique metabolite:

The only unique metabolite is ["s"]. Might this be because of the format we have for annotating metabolites, e.g. s_1234[c], s_2345[e], etc? How is a unique metabolite determined by memote?

model identifier = "COBRAmodel":

When looking at the 3rd line of the SBML file:

<model metaid="yeastGEM.xml" id="COBRAModel" name="Model Exported from COBRA Toolbox" fbc:strict="true">

Not sure about these distinctions, is there a reason to parse id instead of metaid? @tpfau can writeCbModel.m modify the model id?

ChristianLieven commented 6 years ago

The only unique metabolite is ["s"]. Might this be because of the format we have for annotating metabolites, e.g. s_1234[c], s_2345[e], etc? How is a unique metabolite determined by memote?

Good one! Yes that's the issue precisely! The code in memote is:

def find_unique_metabolites(model):
    """Return set of metabolite IDs without duplicates from compartments."""
    # TODO: BiGG specific (met_c).
    return set(met.id.split("_", 1)[0] for met in model.metabolites)

We'll try to find a solution here that handles this appropriately!

BenjaSanchez commented 6 years ago

@ChristianLieven maybe comparing same metabolite.name but different metabolite.compartment? Would still fail for us as we store names as e.g. glucose [cytoplasm], but we should change that anyways :)

Midnighter commented 6 years ago

Can you try again with memote 0.7.4, please? We just merged code that finds duplicate metabolites based on annotation.

ChristianLieven commented 6 years ago

Not sure about these distinctions, is there a reason to parse id instead of metaid? @tpfau can writeCbModel.m modify the model id?

I'm not an expert on SBML so I may be quoting the wrong place here:

3.3.2 Motivations for defining id and name on SBase
18 SBML Level 3 Version 2 Core puts the id and name attributes directly on the SBase abstract class. This
19 is a departure from previous Level/Version combinations of SBML, in which id and name were defined on
20 individual SBML component classes. The following were the motivations for this change.
21 • Object classes in SBML originally were given id attributes only if they had meaning for mathematical
22 expressions. (For instance, the concentrations of species, the sizes of compartments, etc.) As SBML
23 evolved, more objects were given id attributes but these were not always associated with a mathematical
24 meaning. When SBML Level 3 packages were introduced, not only did further uses of id become
25 apparent: it also became possible for Level 3 packages to add nuances of meanings that Level 3 Core did
26 not. Thus, while Level 3 Core may not define a use for an id attribute on a given object, a Level 3 package
27 might. Examples exist in the SBML Level 3 Graphical Layout and Hierarchical Model Composition
28 packages.
29 • The metaid attribute on SBase may seem to fulfill a similar role as an identifier for every object in
30 an SBML document. However, for several reasons it is unsuitable for use as a substitute for id. First,
31 owing to the fact that its data type is the XML type ID, it has a less restricted syntax (for instance,
32 allowing Unicode characters). Second, its scope cannot be ammended as needed for elements such as
33 the UnitDefinition or LocalParameter, because XML ID is defined to have document-wide uniqueness
34 properties. Finally, having a model-wide scope instead of a document-wide scope has been found desirable
35 for the SBML Level 3 Hierarchical Model Composition package.
36 • Because SBML Level 3 Version 2 Core objects may reference objects in SBML Level 3 packages directly,
37 and because package objects themselves may reference Core objects as well as objects from other
38 packages, the simplest means of achieving this is to put the id attribute on SBase directly.
39 • The name attribute is logically paired with id, to provide the option of a user-readable moniker for any
40 object with an id attribute. With id on SBase, it makes sense to put name on SBase too.
41 This design change (moving id and name) does not have an impact on the final written XML form of SBML
42 for constructs that had id and name attributes in SBML Level 3 Version 1 (or earlier versions of SBML).

And:

3.2.3 The metaid attribute
22 The metaid attribute is present for supporting metadata annotations using RDF (Resource Description
23 Format; Lassila and Swick, 1999). It has a data type of XML ID (the XML identifier type; see Section 3.1.6 on
24 p. 11), which means each metaid value must be globally unique within an SBML file. The metaid value serves
25 to identify a model component for purposes such as referencing that component from metadata placed within
26 annotation elements (see Section 3.2.6 on the following page). Such metadata can use RDF description
27 elements, in which an RDF attribute called “rdf:about” points to the metaid identifier of an object defined
28 in the SBML model. This topic is discussed in greater detail in Section 6 on p. 99.

From http://sbml.org/Special/specifications/sbml-level-3/version-2/core/release-1/sbml-level-3-version-2-core.pdf

As far as I understand, the metaID attribute is meant for RDF type annotation.

ChristianLieven commented 6 years ago

@ChristianLieven maybe comparing same metabolite.name but different metabolite.compartment? Would still fail for us as we store names as e.g. glucose [cytoplasm], but we should change that anyways :)

Name is a good option! Checking the ID is unrealiable, although unfortunately the name too is optional and hence may be an issue in other models. I can't really think of a good solution right now, will have to think about it some more.

Can you try again with memote 0.7.4, please? We just merged code that finds duplicate metabolites based on annotation.

Unique metabolites and duplicate metabolites are two different tests. The first is a mere statistical measure that reduces the amount of metabolites to a unique core for instance counting gluc_e, gluc_p and gluc_c as one entity, whereas the latter will try to identify instances where gluc_c and Xx_GLUC0SE_xX coexist in the cytosol.

tpfau commented 6 years ago

Not sure about these distinctions, is there a reason to parse id instead of metaid? @tpfau can writeCbModel.m modify the model id?

Without having read the full discussion: The COBRA Toolbox will modify IDs in 2 ways:

  1. If there is ANY ID in the model field which does not start with a valid start-symbol of an SBMLID (e.g. a 1TRANS reaction) than all ids from that field will get prefixed (R_ for reactions, M_ for metabolites etc). No prefix is applied if there are no invalid IDs in the field. So it is possible, that reactions get prefixed, while metabolites don't get prefixed. This is necessary as otherwise we have no way to be sure that a preceding R_ is indeed just a prefix, or if it is potentially a specific reaction name that we should not modify.
  2. If an ID is no valid SBML ID, it will be converted by replacing all non legal characters by ASCIINUMBER. (I'm currently planning and working on getting the actual model ID stored in an annotation field, as I really don't like the current restrictions/modifications)...

@ChristianLieven

Name is a good option! Checking the ID is unrealiable, although unfortunately the name too is optional and hence may be an issue in other models. I can't really think of a good solution right now, will have to think about it some more.

Does memote require BiGG IDs to be used? If not, I would suggest to look for unique metabolites by comparing annotations. If two bioql:is annotations are identical they should be the same. Of course not all models have annotations, but just using one type of id (BiGG in your instance) to do this is odd (imo).

ChristianLieven commented 6 years ago

Does memote require BiGG IDs to be used?

It should not, or at least I thought I had removed all requirements as to its usage. This one particular test has slipped past and will be fixed ASAP. Again, thanks @BenjaSanchez for spotting this! And, @tpfau indeed we'll compare the annotations now as this is already how we determine duplicate reactions.

Midnighter commented 6 years ago

If an ID is no valid SBML ID, it will be converted by replacing all non legal characters by ASCIINUMBER. (I'm currently planning and working on getting the actual model ID stored in an annotation field, as I really don't like the current restrictions/modifications)...

@tpfau at the moment @matthiaskoenig is writing a new SBML parser for cobrapy and he wants to do the same thing. Store SBML model IDs, make them Python compliant, restore the SBML IDs upon writing the model.

BenjaSanchez commented 6 years ago

@Midnighter ftr I tried it with memote 0.7.4 and the problem persists. On the other hand no duplicate metabolites are detected, so that at least works well :)

@tpfau before I was referring to the model id and metaid: If one updates the COBRA field description, it is transferred to the model metaid in the SBML file. But is it possible to modify to model id?

tpfau commented 6 years ago

@BenjaSanchez: Yes, the modelName field will be converted to a model ID during write. However, I just noticed, that this is not recovered during read. I'll fix that soon.

tpfau commented 6 years ago

Having said that: This is currently undocumented, and I think this will have to change (modelName should go to the name field, not the ID field...). I'm currently putting in a pr to change this to actually define modelID and modelName fields which will then contain ID and name respectively.

ChristianLieven commented 6 years ago

How is this going Ben? Do you need anything specific still from our side?

BenjaSanchez commented 6 years ago

@ChristianLieven there's an open PR #132 to include SBO terms, which solves several of the problems stated here (even though it creates some new ones), could you take a look and give us feedback?

BenjaSanchez commented 6 years ago

update: all tests failing in memote have been moved to new more specific issues: #135, #136, #137 and #138. This issue will hence be closed.