SysBioChalmers / RAVEN

The RAVEN Toolbox for genome scale model reconstruction, curation and analysis.
http://sysbiochalmers.github.io/RAVEN/
Other
100 stars 52 forks source link

feat: addition of `metadata` section to the yaml file specification in RAVEN #311

Closed haowang-bioinfo closed 3 years ago

haowang-bioinfo commented 4 years ago

Description of the issue:

Expected changes:

I hereby confirm that I have:

BenjaSanchez commented 4 years ago

For additional context, below the current metaData field in Human-GEM:

- metaData:
    short_name: "Human-GEM"
    full_name: "Generic genome-scale metabolic model of Homo sapiens"
    version: "1.4.0"
    date: "2020-06-12"
    authors: "Jonathan Robinson, Hao Wang, Pierre-Etienne Cholley, Pinar Kocabas"
    email: "jonrob@chalmers.se"
    organization: "Chalmers University of Technology"
    taxonomy: "9606"
    github: "https://github.com/SysBioChalmers/Human-GEM"
    description: "Genome-scale metabolic models are valuable tools to study metabolism and provide a scaffold for the integrative analysis of omics data. This is the latest version of Human-GEM, which is a genome-scale metabolic model of a generic human cell. The objective of Human-GEM is to serve as a community model for enabling integrative and mechanistic studies of human metabolism."

The new fields + modifications sound good to me. Additionally, it would be ideal if the field names in the yaml file match with the RAVEN spec names, for clarity. Below the cases that don't match based on what is already in RAVEN + the new names @Hao-Chalmers proposed:

Field Name in RAVEN Name in HumanGEM.yml
Model id id short_name
Model name description full_name
Authors annotation.authorList authors
URL where the model lives annotation.sourceUrl github
Additional comments annotation.note description

IMO the RAVEN names for id and URL would be preferable, as the former is the main choice in the COBRA community (Matlab and Python), and the latter is more generic, as not all RAVEN models are stored in Github. Could those 2 fields change in HumanGEM.yml to id and source_url? @JonathanRob @mihai-sysbio

On the other side, the .yml standard seems more adequate for model name, authors and comments (actually it's super confusing that the RAVEN field description is the model name and the field note contains a description). Would it make sense to change those 3 fields in RAVEN to fullName, annotation.authors and annotation.description?

edkerk commented 4 years ago

Are their corresponding (or comparable) COBRA fields for fullName, annotation.authors and annotation.description?

mihai-sysbio commented 4 years ago

Here are the latest yml fields are listed on COBRApy's devel branch. Imho, it doesn't look like a direct mapping of the RAVEN fields. Cobratoolbox has some rules for modelVersion, modelName and modelID.

The short-name is something meant to be as human-friendly as possible. For example, this field is what is shown in the navigation bar on Metabolic Atlas. I found this opencobra thread illustrative of the implications of the BiGG model id spec. Also, I would like to point out the distinct fields for short-name and version. To me, it is of little importance what the keyword for the value of short-name is. However, I am an advocate for its role: human-friendliness. Therefore, I would lean towards keeping this field closer to the standard-GEM naming rather than the BiGG id spec. Needless to say, in the case of versioned models, it is expected of this short-name to be the same as the repository name.

I support changing github to something else. A potential drawback of the source_url is that, as a new person, I could find it confusing if it meant to be the link to the repository, or directly to the file itself on a model hosting platform. But maybe that's just me - and I can't come up with a better suggestion than source_url.

haowang-bioinfo commented 4 years ago

@BenjaSanchez the Expected changes of this issue had been updated as you recommended.

haowang-bioinfo commented 4 years ago

@edkerk according to the latest model spec in COBRA, the following four fields could be associated between RAVEN and COBRA.

Field Name in RAVEN Name in COBRA
Model id id modelID
Model name name modelName
Model version version modelVersion
Additional comments annotation.note description
mihai-sysbio commented 4 years ago

@Hao-Chalmers would the Expected changes also include something about the shortName field?

haowang-bioinfo commented 4 years ago

@mihai-sysbio I don't think an additional shortName field is needed, since it is equivalent to the exiting id field. Or are you suggesting renaming field from id to shortName?

mihai-sysbio commented 4 years ago

I see. To me, an ID does not have to be human friendly, unlike shortName. I think it would be clearer if some examples would be provided, maybe even both "good" and "bad". For example, a "bad" id would be h_sap13417__1_3_0, standing for Homo Sapiens model with 13417 reactions and corresponding to version 1.3.0.

haowang-bioinfo commented 4 years ago

@mihai-sysbio good point in providing examples, which can be both added to the spec in Wiki once a consensus is reached.

edkerk commented 3 years ago

So should HumanGEM's writeHumanYaml be integrated in RAVEN's writeYaml, thereby capturing this metadata?

haowang-bioinfo commented 3 years ago

So should HumanGEM's writeHumanYaml be integrated in RAVEN's writeYaml, thereby capturing this metadata?

@edkerk full support

edkerk commented 3 years ago

It is not sufficient to just define fields in the RAVEN model structure, and support export to YML file format. SBML is still the de facto standard for model distribution, so these fields should also be properly stored there.

Related to this there are some unresolved issues:

  1. If we introduce version, where is this stored in the SBML file? As far as I can find, this is not covered by the SBML specification. I see two options:
    1. The version number can be appended to the model id, e.g. yeastGEM_v8_4_2. Beneficial is that this is also loaded when using cobrapy or COBRA toolbox. However, would we then split the model id from SBML into two parts: (1) model.id and (2) model.version? In that case the model would have different model ids in RAVEN contrasting to cobrapy, COBRA etc. To avoid problems, I would prefer not to run regexprep on any identifier.
    2. Include version number in the SBML as model annotation, in a similar way as taxonomy, authors, organization etc. are included. See example below. However, I don't know what tags to use, something related to <rdf>? Does standard-GEM have a role to play in this?
      Example from iYali, model annotation given from line 4.

<?xml version="1.0" encoding="UTF-8"?>
<sbml xmlns="http://www.sbml.org/sbml/level3/version1/core" xmlns:fbc="http://www.sbml.org/sbml/level3/version1/fbc/version2" xmlns:groups="http://www.sbml.org/sbml/level3/version1/groups/version1" level="3" version="1" fbc:required="false" groups:required="false">
  <model metaid="iYali" id="iYali" name="iYali" fbc:strict="true">
    <annotation>
      <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:vCard="http://www.w3.org/2001/vcard-rdf/3.0#" xmlns:vCard4="http://www.w3.org/2006/vcard/ns#" xmlns:bqbiol="http://biomodels.net/biology-qualifiers/" xmlns:bqmodel="http://biomodels.net/model-qualifiers/">
        <rdf:Description rdf:about="#iYali">
          <dcterms:creator>
            <rdf:Bag>
              <rdf:li rdf:parseType="Resource">
                <vCard:N rdf:parseType="Resource">
                  <vCard:Family>Kerkhoven</vCard:Family>
                  <vCard:Given>Eduard</vCard:Given>
                </vCard:N>
                <vCard:EMAIL>eduardk@chalmers.se</vCard:EMAIL>
                <vCard:ORG rdf:parseType="Resource">
                  <vCard:Orgname>Chalmers University of Technology</vCard:Orgname>
                </vCard:ORG>
              </rdf:li>
            </rdf:Bag>
          </dcterms:creator>
          <dcterms:created rdf:parseType="Resource">
            <dcterms:W3CDTF>2021-04-05T10:27:05Z</dcterms:W3CDTF>
          </dcterms:created>
          <dcterms:modified rdf:parseType="Resource">
            <dcterms:W3CDTF>2021-04-05T10:27:05Z</dcterms:W3CDTF>
          </dcterms:modified>
          <bqbiol:is>
            <rdf:Bag>
              <rdf:li rdf:resource="https://identifiers.org/taxonomy/4952"/>
            </rdf:Bag>
          </bqbiol:is>
        </rdf:Description>
      </rdf:RDF>
    </annotation>

  1. If id is used instead of shortName [and I would argue we should, as id is similar to modelID, model.id and <model id=""> as used in COBRA, cobrapy and SBML], then why use fullName and not just name? The latter is also more in line with other software and the SBML specification.
  2. In humanGEM.yml, date is also specified, should this be part of the RAVEN model structure? And what does this date reflect, when a new version was released? RAVEN generated SBML already includes the date that the file was created, but that's probably not what is meant here. Instead, the date should be set when the new version number is set, and absent if no version number is present?
  3. Where should sourceUrl be stored in the SBML? Also in annotation, as the second suggestion for version?
  4. Note that description is not problematic to store in the SBML, it is actually stored under <notes>. With that in mind, why change note to description? cobrapy has model.notes, and it is closer to the SBML specification.
haowang-bioinfo commented 3 years ago

@edkerk good arguments indeed.

@mihai-sysbio what do you think, if standard-GEM can help in adopting some fields?

edkerk commented 3 years ago

On second thought, perhaps it is better to move the discussion about incorporation in SBML into a separate issue, as the current issue is just about the MATLAB structure and the yaml file. The points that remain relevant are:

  1. Have a model.name field instead of model.fullName.
  2. Have a model.annotation.note field instead of model.annotation.description.
mihai-sysbio commented 3 years ago

@Hao-Chalmers it would make a lot of sense to standardize (and validate) that the yml file has these fields. However, as @edkerk pointed out, maintaining compatibility with existing formats is tricky (1.ii), especially the newly added fields are to be parsed by other tools as well.

To me, the easiest path forward is what @edkerk suggested above:

current issue is just about the MATLAB structure and the yaml file

I would like to emphasize the different use cases for model.short_name and model.full_name. Here is how Metabolic Atlas uses these fields:

    "short_name": "Yeast-GEM",
    "full_name": "Consensus genome-scale metabolic model of Saccharomyces cerevisiae",
    "description": "Consensus genome-scale metabolic model of Saccharomyces cerevisiae. It is the continuation of the legacy project yeastnet",
    "version": "8.4.2",

Luckily, this GEM has a nice model.id, but it's just a coincidence that it is readable. The model.id could well have been yeastGEM_v8_4_2. Since it is an identifier, it will not be parsed into anything readable or worth presenting on a website.

haowang-bioinfo commented 3 years ago

@edkerk @mihai-sysbio I adjusted the Expected changes of this issue according to your input.

edkerk commented 3 years ago

writeYaml (5418e8814d0406259a7c2d526af1debcc6600a37) and the model fields definition (Wiki) are changed according to the discussion here, with the following exception:

Renaming model.description to model.name additionally required small refactoring of 23 files (fe7d417d64a4d1734e0863901a0a0e39439bdb15). As this breaks backwards compatibility with models that would already have been loaded in MATLAB, I suggest these changes result in release 2.5.0 instead of 2.4.4.