Notes in geneProductAssociation causes an error for BiGG but not SBML

kcorreia commented 5 years ago

I included links to protein complexes that carry out reactions. For example: http://identifiers.org/complexportal/CPX-1664

See below for the status from SBML/BiGG validation for files with and without notes for geneProductAssociation:

Snippet that causes problems:

<reaction metaid="R_PFK" id="R_PFK" name="Phosphofructokinase" reversible="false" fast="false" fbc:lowerFluxBound="cobra_default_zero" fbc:upperFluxBound="cobra_default_inf">
  <notes>
    <body xmlns="http://www.w3.org/1999/xhtml">
      <p>SUBSYSTEM: 01.1|central metabolism|glycolysis</p>
    </body>
  </notes>
  <annotation>
    <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:vCard="http://www.w3.org/2001/vcard-rdf/3.0#" xmlns:vCard4="http://www.w3.org/2006/vcard/ns#" xmlns:bqbiol="http://biomodels.net/biology-qualifiers/" xmlns:bqmodel="http://biomodels.net/model-qualifiers/">
      <rdf:Description rdf:about="#R_PFK">
        <bqbiol:is>
          <rdf:Bag>
            <rdf:li rdf:resource="http://identifiers.org/ec-code/2.7.1.11"/>
            <rdf:li rdf:resource="http://identifiers.org/kegg.reaction/R00756"/>
            <rdf:li rdf:resource="http://identifiers.org/kegg.reaction/R04779"/>
            <rdf:li rdf:resource="http://identifiers.org/metacyc.reaction/6PFRUCTPHOS-RXN"/>
            <rdf:li rdf:resource="http://identifiers.org/rhea/140.51"/>
          </rdf:Bag>
        </bqbiol:is>
        <bqbiol:isDescribedBy>
          <rdf:Bag>
            <rdf:li rdf:resource="http://identifiers.org/pubmed/10091602"/>
            <rdf:li rdf:resource="http://identifiers.org/pubmed/11015725"/>
            <rdf:li rdf:resource="http://identifiers.org/pubmed/1387501"/>
            <rdf:li rdf:resource="http://identifiers.org/pubmed/15870456"/>
            <rdf:li rdf:resource="http://identifiers.org/pubmed/17522059"/>
            <rdf:li rdf:resource="http://identifiers.org/pubmed/6223622"/>
            <rdf:li rdf:resource="http://identifiers.org/pubmed/9392075"/>
          </rdf:Bag>
        </bqbiol:isDescribedBy>
      </rdf:Description>
    </rdf:RDF>
  </annotation>
  <listOfReactants>
    <speciesReference species="M_atp_c" stoichiometry="1" constant="true"/>
    <speciesReference species="M_f6p_c" stoichiometry="1" constant="true"/>
  </listOfReactants>
  <listOfProducts>
    <speciesReference species="M_adp_c" stoichiometry="1" constant="true"/>
    <speciesReference species="M_fdp_c" stoichiometry="1" constant="true"/>
    <speciesReference species="M_h_c" stoichiometry="1" constant="true"/>
  </listOfProducts>
  <fbc:geneProductAssociation>
    <notes>
      <body xmlns="http://www.w3.org/1999/xhtml">
        <p>http://identifiers.org/complexportal/CPX-554</p>
        <p>http://identifiers.org/complexportal/CPX-555</p>
      </body>
    </notes>
    <fbc:or>
      <fbc:geneProductRef fbc:geneProduct="FOG00277"/>
      <fbc:geneProductRef fbc:geneProduct="FOG00278"/>
      <fbc:geneProductRef fbc:geneProduct="FOG00279"/>
      <fbc:and>
        <fbc:geneProductRef fbc:geneProduct="FOG00278"/>
        <fbc:geneProductRef fbc:geneProduct="FOG00279"/>
      </fbc:and>
      <fbc:and>
        <fbc:geneProductRef fbc:geneProduct="FOG00278"/>
        <fbc:geneProductRef fbc:geneProduct="FOG00279"/>
        <fbc:geneProductRef fbc:geneProduct="FOG00281"/>
      </fbc:and>
    </fbc:or>
  </fbc:geneProductAssociation>
</reaction>

XML file with notes in geneProductAssociation: Screen Shot 2019-04-20 at 4 12 01 PM

Screen Shot 2019-04-20 at 4 13 16 PM

XML file without notes in geneProductAssociation:

Screen Shot 2019-04-20 at 4 09 07 PM

Screen Shot 2019-04-20 at 4 09 21 PM

draeger commented 5 years ago

This does indeed look like a bug. However, including subsystem information can be done in a much better and more standardized way. Maybe the following approach can solve the specific problem for now, before the BiGG validator can be fixed.

The groups package for SBML is useful to define collections of arbitrary model components. Most models in BiGG use it to define subsystems as a group of reaction. Here is an example from the e_coli_core model:

<sbml xmlns="http://www.sbml.org/sbml/level3/version1/core" fbc:required="false" groups:required="false" level="3" version="1" ... xmlns:fbc="http://www.sbml.org/sbml/level3/version1/fbc/version2" xmlns:groups="http://www.sbml.org/sbml/level3/version1/groups/version1">
  ...
  <model ...>
    <groups:listOfGroups xmlns:groups="http://www.sbml.org/sbml/level3/version1/groups/version1">
      <groups:group groups:id="g1" groups:kind="partonomy" groups:name="Pyruvate Metabolism" sboTerm="SBO:0000633">
        <groups:listOfMembers>
          <groups:member groups:idRef="R_ACALD" />
          <groups:member groups:idRef="R_ACKr" />
          ...
        </groups:listOfMembers>
      </groups:group>
    ...
    <listOfReactions>
      <reaction id="R_ACALD" ... name="Acetaldehyde dehydrogenase (acetylating)"
      reversible="true" sboTerm="SBO:0000375">
        <annotation>
          <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:bqbiol="http://biomodels.net/biology-qualifiers/">
            <rdf:Description rdf:about="#R_ACALD">
              <bqbiol:is>
                <rdf:Bag>
                  <rdf:li rdf:resource="http://identifiers.org/bigg.reaction/ACALD" />
                  <rdf:li rdf:resource="http://identifiers.org/biocyc/META:ACETALD-DEHYDROG-RXN" />
                  ...
                </rdf:Bag>
              </rdf:Description>
            </annotation>
            ...
          </reaction>
    ...
  </model>
</sbml>

So, as you can see, there is a group with ID g1 and the name Pyruvate Metabolism that contains as members several reaction IDs. Instead of writing an unstructured note entry into the reaction element, it is sufficient to define these groups that link to the reactions via their IDs.

Please note that the notes element in SBML is intended to be used for storing human-readable description text that explains choices or other important aspects to users. It is not intended to store computer code or to be algorithmically parsed. In particular, whatever goes to the notes should not be mandatory for a model to compile or simulate. In contrast, annotation elements or the content of groups is informative to computer processing and therefore the preferred way of storing such information.

I hope this helps.

matthiaskoenig commented 5 years ago

Just to comment on this. You can load legacy models with SUBSYSTEMS in cobrapy and export the models with groups information. This could perform the conversion you are interested in I.e.

import cobra
from cobra.io import read_sbml_model, write_sbml_model
model = read_sbml_model("my_model_with_subsystems.xml")
write_sbml_model(model, "my_model_with_groups.xml")

Code not tested. (disclaimer: some of the notes information could get lost, please open an issue on https://github.com/opencobra/cobrapy/issues if you should have any issues)

SBRG / bigg_models

Notes in geneProductAssociation causes an error for BiGG but not SBML #332