Open draeger opened 8 years ago
How exactly will this work if SBO terms are to be unique per reaction?
At the moment there is no field where we can store the confidence scores. We either need parameters or something new. Parameters aren't appropriate, because we cannot refer to a reaction from them. Local parameters aren't suitable either because we would then need to create a kinetic law whose math element must not be empty. I am currently thinking about what to do with confidence scores.
Ah I see. That's still a TBD.
I think one approach would be to create an evidence type. So you could link to a paper, and classify the type of evidence it is. I'm not yet sure if that will work with every case though.
There is an SBML package for distributions, need to check if this can be helpful: http://sourceforge.net/p/sbml/code/HEAD/tree/trunk/specifications/sbml-level-3/version-1/distrib/sbml-level-3-distrib-package-proposal.pdf?format=raw
What parts in particular would be relevant? This seems to be about sampling from distributions, and I can't see how that's related.
Yes. I wanted to check if it also includes confidence scores, but haven't seen it either. Conclusion, we probably need some additional field where we can put this.
I think this calls for a new "citations" or "evidence" package
Good idea! I'll collect all other missing fields and see what else is needed. I'll raise this point in the next SBML team meeting (tomorrow).
This was further discussed in thread opencobra/schema/issues/4, where @matthiaskoenig had the idea to use more specific terms from the evidenceontology.org. We should check if we can make use of this here.
In my opinion an SBO term is the wrong way to do this. The evidence ontology ECO is absolutely sufficient to encode all the evidence today.
It is part of the MIRIAM registry collections https://www.ebi.ac.uk/miriam/main/datatypes/MIR:00000055
Used in multiple projects and allows encoding the evidence for projects like UniProt "Standardized description of scientific evidence using the Evidence Ontology (ECO)" https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4105709/
In addition it is very easy to use and supported today by SBML and other standard formats like CellML. One just has to write the annotation and that is it. No need for any additional package.
To annotate the evidence for an SBML element just write the annotation for the evidence. For instance to say that a certain reaction/protein is based on "high throughput evidence used in automatic assertion (ECO:0006057)" just do:
<rdf:Description rdf:about="./BIOMD0000000176.xml#_525635">
<bqbiol:isDescribedBy rdf:resource="http://identifiers.org/eco/ECO:0006057" />
<rdf:Description/>
Please no new mechanisms if there are established working mechanisms to encode all the information, and in a much better way than evidence codes. By using composite annotations even the original datasets and publications for the evidence can be easily stored in the annotation for the SBML element.
Best Matthias
Here is an overview of the scores as these are usually defined in COBRA, where 0 is best and 4 is lowest confidence.
For the export from COBRA/BiGG models to SBML we will only need to find the closest terms from ECO for these 4 levels. For the other direction we will need to also define a rule how to match terms between those.
Here some suggestion, please feel free to correct. If this is not exact enough additional terms should be added to ECO.
0 = Biochemical: Enzyme has been tested biochemically.
ECO:0000002: direct assay evidence
A type of experimental evidence resulting from the direct measurement of
some aspect of a biological feature.
Or a subclass of it to be more specific like e.g.,
ECO:0000005: enzyme assay evidence
http://evidenceontology.org/browse/#ECO_0000002
1 = Genetic: Gene overexpression and purification, gene deletions.
ECO:0000073: experimental genomic evidence
A type of experimental evidence that is based on the
characterization of an attribute of the genome underlying a gene product.
http://evidenceontology.org/browse/#ECO_0000073
2 = Sequence: There is significant sequence similarity to another gene with known function.
ECO:0000044: sequence similarity evidence
A type of similarity based on biomolecular sequence.
http://evidenceontology.org/browse/#ECO_0000044
3 = Physiological: There is physiological data to support inclusion in the model.
ECO:0005551: biological system reconstruction evidence by experimental evidence
A type of biological system reconstruction evidence that uses
experimental evidence as support.
4 = Modeling: Reaction is included to improve simulation results Personally I think this is problematic, because it states "there is no evidence". Personally I think this should just not have an evidence code, which clearly indicates this was just added without any evidence. I.e. if there is no evidence, i.e, 4 modeling than it has no evidence code. It just states "we added this so we get the results we want" Alternatively something like:
ECO:0000001: inference from background scientific knowledge
A type of curator inference where conclusions are drawn
based on the background scientific knowledge of the curator.
And forgot: About the rules: You just use the ontology tree to match the terms. I.e. everything which is below the respective terms is matched to the terms. If evidence codes in SBML not a subelement of the suggested ECOS than no match can be done.
This looks like a very good start! Thanks @matthiaskoenig. We should also direct @tpfau to this suggestion.
Just to add to this: The big advantage of using annotation via ECO is that it allows to store the confidence! and especially what is the basis of the confidence, because one can add multiple evidence annotations ! This is crucial for metabolic network reconstructions and one of the big short comings of the current confidence scores.
One wants to store for a reaction all the evidence which is there, not only the minimal common denominator. Example given: One has a reaction R1
Suddenly you have the collection of evidence and confidence for the reaction and not only a "0". Confidence scores is nothing anybody should use in a reconstruction in 2018.
Supporting this is something @midnighter and @cdiener may also want to consider when improving the cobrapy parsers. Once this finds its way into Cobrapy.Model objects I'm very happy to start writing tests for this in memote.
Important to me is that one can directly link the ECO terms with links to the literature (DOI, PubmedID, etc). But if I understand @matthiaskoenig correctly, composite annotations would allow us to do this!
In general I think using ECO here is a very good idea. However, there are methods which rely on the 0-4 schematic used by bigg, and we should offer some way to translate at least the ECO top levels:
ECO:0000006 experimental evidence
ECO:0000041 similarity evidence
ECO:0000088 biological system reconstruction evidence
ECO:0000177 genomic context evidence
ECO:0000204 author statement
ECO:0000212 combinatorial evidence
ECO:0000311 imported information
ECO:0000352 evidence used in manual assertion
ECO:0000361 inferential evidence
ECO:0000501 evidence used in automatic assertion
ECO:0006055 high throughput evidence
To the 0-4 levels.
Please note that COBRA's definition of the confidence scores (0= best, 4 = lowest confidence score) is inverse to the definition of Ines Thiele's and Bernhard Ø. Palsson's "A protocol for generating a high-quality genome-scale metabolic reconstruction", where 4 is the best and 0 the lowest confidence score (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3125167/table/T2/?report=objectonly).
Hence, using ECO numbers instead of scores from 0 to 4 might help to avoid confusion.
Request a new SBO term to be used for confidence scores.