SBML model does not load with reaction or metabolite annotations

HettieC commented 2 years ago

using COBREXA
download("http://bigg.ucsd.edu/static/models/e_coli_core.xml", "e_coli_core.xml")
model = load_model(StandardModel,"e_coli_core.xml");
model.reactions["R_ACONTb"].annotations

Expected result: list of the below reaction annotations, which are present in the JSON model. The SBML model passes the memote.io metabolite annotation test so presumably the SBML version does contain the same annotations as the JSON version of the model.

bigg.reaction: ["ACONTb"]
metanetx.reaction: ["MNXR95387"]
rhea: 22145, ..., 22147
sbo: ["SBO:0000176"]
seed.reaction: ["rxn01388"]
kegg.reaction: ["R01900"]
ec-code: ["4.2.1.3"]

Actual behavior: an empty dictionary is produced when model.reactions["R_ACONTb"].annotations is called:

Dict{String, Vector{String}}()Dict{String, Vector{String}}()

Note: the same issue occurs with metabolites.

exaexa commented 2 years ago

Hi!

Short answer: SBML annotations are too complicated.

Long answer:

The model annotations in SBML can contain basically any valid XML (the standard doesn't really care although in L3V1 and L3V2 you should use MIRIAM, as below). The annotations are present if you load the thing without converting to StandardModel:

julia> m=load_model("test/downloaded/e_coli_core.xml");

julia> println(m.sbml.reactions["R_ACONTb"].annotation)
<sbml:annotation xmlns:sbml="http://www.sbml.org/sbml/level3/version1/core">
  <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
    <rdf:Description rdf:about="#R_ACONTb">
      <bqbiol:is xmlns:bqbiol="http://biomodels.net/biology-qualifiers/">
        <rdf:Bag>
          <rdf:li rdf:resource="http://identifiers.org/bigg.reaction/ACONTb"/>
          <rdf:li rdf:resource="http://identifiers.org/ec-code/4.2.1.3"/>
          <rdf:li rdf:resource="http://identifiers.org/kegg.reaction/R01900"/>
          <rdf:li rdf:resource="http://identifiers.org/metanetx.reaction/MNXR95387"/>
          <rdf:li rdf:resource="http://identifiers.org/rhea/22145"/>
          <rdf:li rdf:resource="http://identifiers.org/rhea/22144"/>
          <rdf:li rdf:resource="http://identifiers.org/rhea/22146"/>
          <rdf:li rdf:resource="http://identifiers.org/rhea/22147"/>
          <rdf:li rdf:resource="http://identifiers.org/seed.reaction/rxn01388"/>
        </rdf:Bag>
      </bqbiol:is>
    </rdf:Description>
  </rdf:RDF>
</sbml:annotation>

Parsing of the identifiers into the "dictionary" format of StandardModel would be more or less impossible -- the identifiers are the interoperable IRIs, and do not possess a dictionary representation. You can access them (see e.g. http://identifiers.org/kegg.reaction/R01900 ) and download the IDs yourself, or just regex-match the IRI prefixes and IDs from the IRIs (which is a totally YOLO solution but quite reliable since most of the identifier database services are relatively sane :D).

The richness of annotations is needed for the annotations to stay reproducible and interoperable (and this is basically the only widely accepted way to implement FAIR data), while still describing complicated stuff from the nature. (Example: In the history there have already been more databases called RHEA (and there will be other!), and if you are a computer, it's not at all obvious which one the "rhea" in the JSON annotation actually refers to. Example 2 (with a chemical twist): Is the metabolite 13dpg_c this https://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:57604 or this https://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:20189 ? )

The richness of the annotations also brings various additional dangers, for example we can't just blindly regex match the URLs in the annotation text because the description could also say something completely else about the referred entities (see here http://biomodels.net/biology-qualifiers/ -- typically is may be something completely different than isPartOf and hasPart, which would be able to solve the example 2 from above).

@stelmo any chance StandardModel can support IRIs and bqbiol: qualifiers?

Recommendations:

Since SBML simply doesn't support the "easy" non-interoperable "identifiers", you'll probably want to use JSON for now.
A week ago I've been actually trying to find a good way to parse some structured information out of SBML annotations. If something materializes (I guess we could do a MIRIAM parser quite soon), we'll add it to SBML.jl and eventually to COBREXA.jl.

HettieC commented 2 years ago

OK thanks, will stick to JSON!

exaexa commented 2 years ago

I'll try moving this to SBML.jl because that's basically where this should be solved.

exaexa commented 2 years ago

So actually--

the "annotations" from JSON models seem to be more like SBOTerms and CVTerms in SBML. In fact, CobraPy saves these precisely as SBO and CV terms: https://github.com/opencobra/cobrapy/blob/devel/src/cobra/io/sbml.py#L1881

This therefore revives ye olde issue #27 :D

exaexa commented 2 years ago

@HettieC in #235 we're parsing out actual URIs of the linked identifiers (from identifiers.org), you will be able to rely on these when 1.3 is released (hopefully soon). One side issue that's left is that you'll need to interpret the URIs yourself to get the cobrapy-like form that you can see in JSONs, but that should be super easy.

Example output with #235:

julia> using SBML

julia> m = readSBML("work/COBREXA.jl/test/downloaded/e_coli_core.xml")
SBML.Model with 95 reactions, 72 species, and 5 parameters.

julia> m.reactions["R_ACONTb"].cv_terms[1]
SBML.CVTerm(
    :is,                 # <-- important part to check
    nothing,        # <-- there might be a different kind of qualifier here (likely also :is)
    [
        "http://identifiers.org/bigg.reaction/ACONTb",
        "http://identifiers.org/ec-code/4.2.1.3",
        "http://identifiers.org/kegg.reaction/R01900",
        "http://identifiers.org/metanetx.reaction/MNXR95387",
        "http://identifiers.org/rhea/22145",
        "http://identifiers.org/rhea/22144",
        "http://identifiers.org/rhea/22146",
        "http://identifiers.org/rhea/22147",
        "http://identifiers.org/seed.reaction/rxn01388",
    ],
    SBML.CVTerm[],
)

Code to convert stuff to JSON-like annotations:

julia> [
           =>(match(r"https?:\/\/identifiers\.org\/([^/]+)\/([^/]+)", uri).captures...) for
           c = m.reactions["R_ACONTb"].cv_terms if c.biological_qualifier == :is for
           uri in c.resource_uris
       ]

9-element Vector{Pair{SubString{String}, SubString{String}}}:
     "bigg.reaction" => "ACONTb"
           "ec-code" => "4.2.1.3"
     "kegg.reaction" => "R01900"
 "metanetx.reaction" => "MNXR95387"
              "rhea" => "22145"
              "rhea" => "22144"
              "rhea" => "22146"
              "rhea" => "22147"
     "seed.reaction" => "rxn01388"

(there's likely straightforward way there to make it a dictionary-of-sets as with json, but I'm not sure now)

HettieC commented 2 years ago

Cool, thank you

LCSB-BioCore / SBML.jl

SBML model does not load with reaction or metabolite annotations #208