SBO term for confidence score

draeger commented 8 years ago

Request a new SBO term to be used for confidence scores.

aebrahim commented 8 years ago

How exactly will this work if SBO terms are to be unique per reaction?

draeger commented 8 years ago

At the moment there is no field where we can store the confidence scores. We either need parameters or something new. Parameters aren't appropriate, because we cannot refer to a reaction from them. Local parameters aren't suitable either because we would then need to create a kinetic law whose math element must not be empty. I am currently thinking about what to do with confidence scores.

aebrahim commented 8 years ago

Ah I see. That's still a TBD.

I think one approach would be to create an evidence type. So you could link to a paper, and classify the type of evidence it is. I'm not yet sure if that will work with every case though.

draeger commented 8 years ago

There is an SBML package for distributions, need to check if this can be helpful: http://sourceforge.net/p/sbml/code/HEAD/tree/trunk/specifications/sbml-level-3/version-1/distrib/sbml-level-3-distrib-package-proposal.pdf?format=raw

aebrahim commented 8 years ago

What parts in particular would be relevant? This seems to be about sampling from distributions, and I can't see how that's related.

draeger commented 8 years ago

Yes. I wanted to check if it also includes confidence scores, but haven't seen it either. Conclusion, we probably need some additional field where we can put this.

aebrahim commented 8 years ago

I think this calls for a new "citations" or "evidence" package

draeger commented 8 years ago

Good idea! I'll collect all other missing fields and see what else is needed. I'll raise this point in the next SBML team meeting (tomorrow).

draeger commented 6 years ago

This was further discussed in thread opencobra/schema/issues/4, where @matthiaskoenig had the idea to use more specific terms from the evidenceontology.org. We should check if we can make use of this here.

matthiaskoenig commented 6 years ago

In my opinion an SBO term is the wrong way to do this. The evidence ontology ECO is absolutely sufficient to encode all the evidence today.

It is part of the MIRIAM registry collections https://www.ebi.ac.uk/miriam/main/datatypes/MIR:00000055

Used in multiple projects and allows encoding the evidence for projects like UniProt "Standardized description of scientific evidence using the Evidence Ontology (ECO)" https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4105709/

In addition it is very easy to use and supported today by SBML and other standard formats like CellML. One just has to write the annotation and that is it. No need for any additional package.

To annotate the evidence for an SBML element just write the annotation for the evidence. For instance to say that a certain reaction/protein is based on "high throughput evidence used in automatic assertion (ECO:0006057)" just do:

 <rdf:Description rdf:about="./BIOMD0000000176.xml#_525635">
    <bqbiol:isDescribedBy rdf:resource="http://identifiers.org/eco/ECO:0006057" />
<rdf:Description/>

Please no new mechanisms if there are established working mechanisms to encode all the information, and in a much better way than evidence codes. By using composite annotations even the original datasets and publications for the evidence can be easily stored in the annotation for the SBML element.

Best Matthias

draeger commented 6 years ago

Here is an overview of the scores as these are usually defined in COBRA, where 0 is best and 4 is lowest confidence.

0 = Biochemical: Enzyme has been tested biochemically.
1 = Genetic: Gene overexpression and purification, gene deletions.
2 = Sequence: There is significant sequence similarity to another gene with known function.
3 = Physiological: There is physiological data to support inclusion in the model.
4 = Modeling: Reaction is included to improve simulation results

For the export from COBRA/BiGG models to SBML we will only need to find the closest terms from ECO for these 4 levels. For the other direction we will need to also define a rule how to match terms between those.

matthiaskoenig commented 6 years ago

Here some suggestion, please feel free to correct. If this is not exact enough additional terms should be added to ECO.

0 = Biochemical: Enzyme has been tested biochemically.

ECO:0000002: direct assay evidence
A type of experimental evidence resulting from the direct measurement of 
some aspect of a biological feature.

Or a subclass of it to be more specific like e.g., ECO:0000005: enzyme assay evidence http://evidenceontology.org/browse/#ECO_0000002

1 = Genetic: Gene overexpression and purification, gene deletions.

ECO:0000073: experimental genomic evidence
A type of experimental evidence that is based on the 
characterization of an attribute of the genome underlying a gene product.

http://evidenceontology.org/browse/#ECO_0000073

2 = Sequence: There is significant sequence similarity to another gene with known function.

ECO:0000044: sequence similarity evidence 
A type of similarity based on biomolecular sequence.

http://evidenceontology.org/browse/#ECO_0000044

3 = Physiological: There is physiological data to support inclusion in the model.

ECO:0005551: biological system reconstruction evidence by experimental evidence
A type of biological system reconstruction evidence that uses 
experimental evidence as support.

4 = Modeling: Reaction is included to improve simulation results Personally I think this is problematic, because it states "there is no evidence". Personally I think this should just not have an evidence code, which clearly indicates this was just added without any evidence. I.e. if there is no evidence, i.e, 4 modeling than it has no evidence code. It just states "we added this so we get the results we want" Alternatively something like:

ECO:0000001: inference from background scientific knowledge
A type of curator inference where conclusions are drawn 
based on the background scientific knowledge of the curator.

matthiaskoenig commented 6 years ago

And forgot: About the rules: You just use the ontology tree to match the terms. I.e. everything which is below the respective terms is matched to the terms. If evidence codes in SBML not a subelement of the suggested ECOS than no match can be done.

draeger commented 6 years ago

This looks like a very good start! Thanks @matthiaskoenig. We should also direct @tpfau to this suggestion.

matthiaskoenig commented 6 years ago

Just to add to this: The big advantage of using annotation via ECO is that it allows to store the confidence! and especially what is the basis of the confidence, because one can add multiple evidence annotations ! This is crucial for metabolic network reconstructions and one of the big short comings of the current confidence scores.

One wants to store for a reaction all the evidence which is there, not only the minimal common denominator. Example given: One has a reaction R1

there is some evidence based on homology to mouse (-> add an annotation to homology evidence)
there is some evidence based on protein data (-> add an annotation to experimental evidence based on protein)
there is some bioinformatics inference for R1 (-> add the inference evidence)
there is some indirect evidence based on mRNA (-> add the infered from experimental data evidence)
If there is some evidence only in a certain tissue, based on omics data (-> add a complex annotation of this evidence to the reaction)

Suddenly you have the collection of evidence and confidence for the reaction and not only a "0". Confidence scores is nothing anybody should use in a reconstruction in 2018.

ChristianLieven commented 6 years ago

Supporting this is something @midnighter and @cdiener may also want to consider when improving the cobrapy parsers. Once this finds its way into Cobrapy.Model objects I'm very happy to start writing tests for this in memote.

Important to me is that one can directly link the ECO terms with links to the literature (DOI, PubmedID, etc). But if I understand @matthiaskoenig correctly, composite annotations would allow us to do this!

tpfau commented 6 years ago

In general I think using ECO here is a very good idea. However, there are methods which rely on the 0-4 schematic used by bigg, and we should offer some way to translate at least the ECO top levels:

        ECO:0000006 experimental evidence
        ECO:0000041 similarity evidence
        ECO:0000088 biological system reconstruction evidence
        ECO:0000177 genomic context evidence
        ECO:0000204 author statement
        ECO:0000212 combinatorial evidence
        ECO:0000311 imported information
        ECO:0000352 evidence used in manual assertion
        ECO:0000361 inferential evidence
        ECO:0000501 evidence used in automatic assertion
        ECO:0006055 high throughput evidence

To the 0-4 levels.

Linelili commented 5 years ago

Please note that COBRA's definition of the confidence scores (0= best, 4 = lowest confidence score) is inverse to the definition of Ines Thiele's and Bernhard Ø. Palsson's "A protocol for generating a high-quality genome-scale metabolic reconstruction", where 4 is the best and 0 the lowest confidence score (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3125167/table/T2/?report=objectonly).

Hence, using ECO numbers instead of scores from 0 to 4 might help to avoid confusion.

draeger-lab / ModelPolisher

SBO term for confidence score #5