feat: addition of `metadata` section to the yaml file specification in RAVEN

haowang-bioinfo commented 4 years ago

Description of the issue:

This issue propose to include a metadata section to the yaml file specification in RAVEN
- Previously, a metadata section was introduced to the tailored yaml file in Human-GEM serving for the requirements of MetabolicAtlas, as detailed in issue #71. After continuous development and evolvement, this section functions pretty well in providing relevant information for GEM-type repo (e.g. Human-GEM), GEM archive MetabolicAtals, as well as the research community.

Expected changes:

Adjust RAVEN model spec with following changes:
- Add new field version
- ~Change field from description to fullName~
- Modify subfields of annotation field
  - adding subfields sourceUrl
  - combining givenName and familyName into authors
  - ~changing subfield from note to description~
Adapt writeYaml function to enable the exporting of metadata information from fields id, ~fullName~name, version and annotation

I hereby confirm that I have:

[X] Followed the guidelines to install RAVEN.
[X] Checked that a similar issue does not already exist

BenjaSanchez commented 4 years ago

For additional context, below the current metaData field in Human-GEM:

- metaData:
    short_name: "Human-GEM"
    full_name: "Generic genome-scale metabolic model of Homo sapiens"
    version: "1.4.0"
    date: "2020-06-12"
    authors: "Jonathan Robinson, Hao Wang, Pierre-Etienne Cholley, Pinar Kocabas"
    email: "jonrob@chalmers.se"
    organization: "Chalmers University of Technology"
    taxonomy: "9606"
    github: "https://github.com/SysBioChalmers/Human-GEM"
    description: "Genome-scale metabolic models are valuable tools to study metabolism and provide a scaffold for the integrative analysis of omics data. This is the latest version of Human-GEM, which is a genome-scale metabolic model of a generic human cell. The objective of Human-GEM is to serve as a community model for enabling integrative and mechanistic studies of human metabolism."

The new fields + modifications sound good to me. Additionally, it would be ideal if the field names in the yaml file match with the RAVEN spec names, for clarity. Below the cases that don't match based on what is already in RAVEN + the new names @Hao-Chalmers proposed:

Field	Name in RAVEN	Name in `HumanGEM.yml`
Model id	`id`	`short_name`
Model name	`description`	`full_name`
Authors	`annotation.authorList`	`authors`
URL where the model lives	`annotation.sourceUrl`	`github`
Additional comments	`annotation.note`	`description`

IMO the RAVEN names for id and URL would be preferable, as the former is the main choice in the COBRA community (Matlab and Python), and the latter is more generic, as not all RAVEN models are stored in Github. Could those 2 fields change in HumanGEM.yml to id and source_url? @JonathanRob @mihai-sysbio

On the other side, the .yml standard seems more adequate for model name, authors and comments (actually it's super confusing that the RAVEN field description is the model name and the field note contains a description). Would it make sense to change those 3 fields in RAVEN to fullName, annotation.authors and annotation.description?

edkerk commented 4 years ago

Are their corresponding (or comparable) COBRA fields for fullName, annotation.authors and annotation.description?

mihai-sysbio commented 4 years ago

Here are the latest yml fields are listed on COBRApy's devel branch. Imho, it doesn't look like a direct mapping of the RAVEN fields. Cobratoolbox has some rules for modelVersion, modelName and modelID.

The short-name is something meant to be as human-friendly as possible. For example, this field is what is shown in the navigation bar on Metabolic Atlas. I found this opencobra thread illustrative of the implications of the BiGG model id spec. Also, I would like to point out the distinct fields for short-name and version. To me, it is of little importance what the keyword for the value of short-name is. However, I am an advocate for its role: human-friendliness. Therefore, I would lean towards keeping this field closer to the standard-GEM naming rather than the BiGG id spec. Needless to say, in the case of versioned models, it is expected of this short-name to be the same as the repository name.

I support changing github to something else. A potential drawback of the source_url is that, as a new person, I could find it confusing if it meant to be the link to the repository, or directly to the file itself on a model hosting platform. But maybe that's just me - and I can't come up with a better suggestion than source_url.

haowang-bioinfo commented 4 years ago

@BenjaSanchez the Expected changes of this issue had been updated as you recommended.

haowang-bioinfo commented 4 years ago

@edkerk according to the latest model spec in COBRA, the following four fields could be associated between RAVEN and COBRA.

Field	Name in RAVEN	Name in COBRA
Model id	`id`	`modelID`
Model name	`name`	`modelName`
Model version	`version`	`modelVersion`
Additional comments	`annotation.note`	`description`

mihai-sysbio commented 4 years ago

@Hao-Chalmers would the Expected changes also include something about the shortName field?

haowang-bioinfo commented 4 years ago

@mihai-sysbio I don't think an additional shortName field is needed, since it is equivalent to the exiting id field. Or are you suggesting renaming field from id to shortName?

mihai-sysbio commented 4 years ago

I see. To me, an ID does not have to be human friendly, unlike shortName. I think it would be clearer if some examples would be provided, maybe even both "good" and "bad". For example, a "bad" id would be h_sap13417__1_3_0, standing for Homo Sapiens model with 13417 reactions and corresponding to version 1.3.0.

haowang-bioinfo commented 4 years ago

@mihai-sysbio good point in providing examples, which can be both added to the spec in Wiki once a consensus is reached.

edkerk commented 3 years ago

So should HumanGEM's writeHumanYaml be integrated in RAVEN's writeYaml, thereby capturing this metadata?

haowang-bioinfo commented 3 years ago

So should HumanGEM's writeHumanYaml be integrated in RAVEN's writeYaml, thereby capturing this metadata?

@edkerk full support

edkerk commented 3 years ago

It is not sufficient to just define fields in the RAVEN model structure, and support export to YML file format. SBML is still the de facto standard for model distribution, so these fields should also be properly stored there.

Related to this there are some unresolved issues:

If we introduce version, where is this stored in the SBML file? As far as I can find, this is not covered by the SBML specification. I see two options:
1. The version number can be appended to the model id, e.g. yeastGEM_v8_4_2. Beneficial is that this is also loaded when using cobrapy or COBRA toolbox. However, would we then split the model id from SBML into two parts: (1) model.id and (2) model.version? In that case the model would have different model ids in RAVEN contrasting to cobrapy, COBRA etc. To avoid problems, I would prefer not to run regexprep on any identifier.
2. Include version number in the SBML as model annotation, in a similar way as taxonomy, authors, organization etc. are included. See example below. However, I don't know what tags to use, something related to <rdf>? Does standard-GEM have a role to play in this?
  
  Example from iYali, model annotation given from line 4.

<?xml version="1.0" encoding="UTF-8"?>
<sbml xmlns="http://www.sbml.org/sbml/level3/version1/core" xmlns:fbc="http://www.sbml.org/sbml/level3/version1/fbc/version2" xmlns:groups="http://www.sbml.org/sbml/level3/version1/groups/version1" level="3" version="1" fbc:required="false" groups:required="false">
  <model metaid="iYali" id="iYali" name="iYali" fbc:strict="true">
    <annotation>
      <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:vCard="http://www.w3.org/2001/vcard-rdf/3.0#" xmlns:vCard4="http://www.w3.org/2006/vcard/ns#" xmlns:bqbiol="http://biomodels.net/biology-qualifiers/" xmlns:bqmodel="http://biomodels.net/model-qualifiers/">
        <rdf:Description rdf:about="#iYali">
          <dcterms:creator>
            <rdf:Bag>
              <rdf:li rdf:parseType="Resource">
                <vCard:N rdf:parseType="Resource">
                  <vCard:Family>Kerkhoven</vCard:Family>
                  <vCard:Given>Eduard</vCard:Given>
                </vCard:N>
                <vCard:EMAIL>eduardk@chalmers.se</vCard:EMAIL>
                <vCard:ORG rdf:parseType="Resource">
                  <vCard:Orgname>Chalmers University of Technology</vCard:Orgname>
                </vCard:ORG>
              </rdf:li>
            </rdf:Bag>
          </dcterms:creator>
          <dcterms:created rdf:parseType="Resource">
            <dcterms:W3CDTF>2021-04-05T10:27:05Z</dcterms:W3CDTF>
          </dcterms:created>
          <dcterms:modified rdf:parseType="Resource">
            <dcterms:W3CDTF>2021-04-05T10:27:05Z</dcterms:W3CDTF>
          </dcterms:modified>
          <bqbiol:is>
            <rdf:Bag>
              <rdf:li rdf:resource="https://identifiers.org/taxonomy/4952"/>
            </rdf:Bag>
          </bqbiol:is>
        </rdf:Description>
      </rdf:RDF>
    </annotation>

If id is used instead of shortName [and I would argue we should, as id is similar to modelID, model.id and <model id=""> as used in COBRA, cobrapy and SBML], then why use fullName and not just name? The latter is also more in line with other software and the SBML specification.
In humanGEM.yml, date is also specified, should this be part of the RAVEN model structure? And what does this date reflect, when a new version was released? RAVEN generated SBML already includes the date that the file was created, but that's probably not what is meant here. Instead, the date should be set when the new version number is set, and absent if no version number is present?
Where should sourceUrl be stored in the SBML? Also in annotation, as the second suggestion for version?
Note that description is not problematic to store in the SBML, it is actually stored under <notes>. With that in mind, why change note to description? cobrapy has model.notes, and it is closer to the SBML specification.

haowang-bioinfo commented 3 years ago

@edkerk good arguments indeed.

@mihai-sysbio what do you think, if standard-GEM can help in adopting some fields?

edkerk commented 3 years ago

On second thought, perhaps it is better to move the discussion about incorporation in SBML into a separate issue, as the current issue is just about the MATLAB structure and the yaml file. The points that remain relevant are:

Have a model.name field instead of model.fullName.
Have a model.annotation.note field instead of model.annotation.description.

mihai-sysbio commented 3 years ago

@Hao-Chalmers it would make a lot of sense to standardize (and validate) that the yml file has these fields. However, as @edkerk pointed out, maintaining compatibility with existing formats is tricky (1.ii), especially the newly added fields are to be parsed by other tools as well.

To me, the easiest path forward is what @edkerk suggested above:

current issue is just about the MATLAB structure and the yaml file

I would like to emphasize the different use cases for model.short_name and model.full_name. Here is how Metabolic Atlas uses these fields:

    "short_name": "Yeast-GEM",
    "full_name": "Consensus genome-scale metabolic model of Saccharomyces cerevisiae",
    "description": "Consensus genome-scale metabolic model of Saccharomyces cerevisiae. It is the continuation of the legacy project yeastnet",
    "version": "8.4.2",

Luckily, this GEM has a nice model.id, but it's just a coincidence that it is readable. The model.id could well have been yeastGEM_v8_4_2. Since it is an identifier, it will not be parsed into anything readable or worth presenting on a website.

haowang-bioinfo commented 3 years ago

@edkerk @mihai-sysbio I adjusted the Expected changes of this issue according to your input.

edkerk commented 3 years ago

writeYaml (5418e8814d0406259a7c2d526af1debcc6600a37) and the model fields definition (Wiki) are changed according to the discussion here, with the following exception:

givenName and familyName remain as (non-mandatory) fields, while authors is an additional (non-mandatory) field. This is to ensure backwards compatibility, as givenName and familyName are actually coded in the SBML, and authors is not, while their meaning is not identical (givenName and familyName would match organization and email, while for authors this is ambigious).
also other subfields of model.annotation (defaultLB, defaultUB) are included as metaData in the yaml file.
by default writeYaml no longer sorts the identifiers (it used to do this, while writeHumanYaml doesn't, probably best to keep the identifier order by default).

Renaming model.description to model.name additionally required small refactoring of 23 files (fe7d417d64a4d1734e0863901a0a0e39439bdb15). As this breaks backwards compatibility with models that would already have been loaded in MATLAB, I suggest these changes result in release 2.5.0 instead of 2.4.4.

SysBioChalmers / RAVEN

feat: addition of `metadata` section to the yaml file specification in RAVEN #311

Description of the issue:

Expected changes: