ga4gh / ga4gh-schemas

Models and APIs for Genomic data. RETIRED 2018-01-24
http://ga4gh.org
Apache License 2.0
214 stars 110 forks source link

Ontology representation #350

Open helenp opened 9 years ago

helenp commented 9 years ago

Following discussion on the MTT here is a proposal refining the definition and representation of ontologies and annotations in ontology.avdl.

Intent: The GA4GH Ontology schema provides structures for unambiguous references to ontological concepts and/or controlled vocabularies within AVRO. The structures provided are not intended for de novo modeling of ontologies, or representing complete ontologies within AVRO. References to e.g. classes from external ontologies or controlled vocabularies should be interpreted only in their original context i.e. the source ontology.

Usage Multiple ontology terms can be supplied e.g. to describe a series of phenotypes for a specific sample. The ontology.avdl is not intended to model relationships between terms, or to provide mappings between ontologies for the same concept. Should an OntologyTerm be unavailable, or terms unmapped then an 'annotation' can be provided which can later be mapped to an ontology term using a service designed for this. Using OntologyTerm is preferred to using Annotation. Though annotations can be supplied with related ontology terms if desired. A use case could be when a free text annotation is very specific and a more general OntologyTerm is supplied.

New: Annotation - A free text annotation which is not an ontology term describing some attribute. Annotations have associations with OntologyTerms to allow these to be added after annotations are captured. OntologyTerms are preferred over Annotations in all cases. Annotations can be used in conjucntion with OntologyTerms

Newly defined OntologyTerm - the preferred term for the class in question. For example http://purl.obolibrary.org/obo/HP_0011927 preferred term is 'short digit' and synonym is 'VERY SHORT DIGIT'. 'short digit' is the term that should be used.

Newly defined OntologyTerm identifier - An identifier for a single ontology term from a single ontology source specified as a CURIE (preferred) or PURL

Newly defined OntologySource - the name of ontology from which the term is obtained. e.g. 'Human Phenotype Ontology'

Newly Defined OntologySource identifier - the identifier -a CURIE (preferred) or PURL for an ontology source e.g. http://purl.obolibrary.org/obo/hp.obo

Newly defined OntologySource version - the version of the ontology from which the OntologyTerm is obtained. E.g. 2.6.1. There is no standard for ontology versioning and some frequently released ontologies may use a datestamp, or build number.

nmcabili commented 9 years ago

seems good to me. +1.

cmungall commented 9 years ago

versioning: in OWL the correct way to do this is to use versionIRI. Not all OWL ontologies do this, but the ones that do not are becoming outliers.

the versionIRI can be any IRI. In the context of the OBO Library, the versionIRI follows a standard pattern (base PURL of the ontology followed by an optional releases followed by the ISO-8601 YYYY-MM-DD). Furthermore, any versionIRI that follows this standard pattern has a standard reduced form in obo format.

For example:

http://purl.obolibrary.org/obo/pato/releases/2015-04-09/pato.owl

Resolves to OWL with the following in the header:

        <owl:versionIRI rdf:resource="http://purl.obolibrary.org/obo/pato/releases/2015-04-09/pato.owl"/>

If you look in the corresponding obo file:

data-version: releases/2015-04-09

Not all ontologies of relevance to GA4GH provide resolvable versionIRIs. However, I have been working on a mechanism that makes this easy for any ontology managed in github, and I have been successful in migrating many ontologies to GH. Is there a list or registry, e.g. in yaml format that shows all ontologies of interest? I can annotate this list with the version policy for each ontology, and with the help of folks here like @helenp and @mellybelly encourage movement to a well-defined system. Comments on the OBO version system are also more than welcome.

cmungall commented 9 years ago

There is a potential problem with the current ontologies.avdl.

Currently the OntologyTerm has an identifier which might reasonably be expected to be a primary key (or in programmatic terms can be used as a key in a lookup table; or if we were to use JSON-LD this would be the @id that denotes the RDF-resource).

The record also has a version field. Regardless of the format of this version field, we have a potential major problem because the same identifier may denote different versions of an OntologyTerm within a single GA4GH compliant source. It will be difficult to define coherent services this way.

Some options:

Make OntologyTerms immutable

i.e. using the same ID would always return the same OntologyTerm object.

This has some nice properties, but the boat has long sailed on this one.

We could implement this by folding the version into the identifier... but this would be highly impractical

Use a compound key

An OT would be uniquely identified by a (ID, ontologyVersion) tuple. But this would be unintuitive, and undesirable for various reasons - the JSON-LD would be fairly impractical if we go that route

Change the modeling

introduce an extra layer, like this:

record OntologyReference {
   String ontologyVersion;
   OntologyTerm ontologyTerm;
}
record OntologyTerm {
  String id;
  String label; /* mutable */
}

other parts of the schema would use ontologyReferences to make annotations

In other words, it is the act of referencing that is associated with an ontology version. "I used the version of HP:1234 from 2015-01-01".

(it may be the case that OntologyReference will be extended into a generic oban-style annotation object, but I'd like to separate that discussion for now)

mbaudis commented 9 years ago

@cmungall Regarding version conflicts: While this could not be enforced through the schema, one could just describe the recommended order of precedence (i.e. idWithVersion overrides separate version).

One problem with the reference is that we would be nesting even deeper:

mbaudis commented 9 years ago

The fast edit experimental working metadata version of ontologies.avdl resides on the metadata branch https://github.com/ga4gh/schemas/blob/metadata/src/main/resources/avro/ontologies.avdl

Questions:

helenp commented 9 years ago

@cmungall Thanks for the comments @mbaudis suggest we change spec to indicate where version info can be found and EBI will collate these with the CURIES as proposed by Chris.

mbaudis commented 9 years ago

@cmungall @helenp So, could you please provide an example (pseudo code is fine)?

selewis commented 9 years ago

Be easier to do this with a picture but I'll try here (keep in mind this is overly simplified and lots of things are being left out):

Patient-object:id1 ---exhibitsphenotype--> Association-object:(includes date/version)---classifiedby--> OntologyTerm:idX

Nesting per se is not a problem. If it is need, then it is needed.