ga4gh / ga4gh-schemas

Models and APIs for Genomic data. RETIRED 2018-01-24
http://ga4gh.org
Apache License 2.0
214 stars 114 forks source link

OntologyTerm.id definition #621

Closed jeromekelleher closed 7 years ago

jeromekelleher commented 8 years ago

The docs for OntologyTerm.id read as follows:

  Ontology source identifier - the identifier, a CURIE (preferred) or
  PURL for an ontology source e.g. http://purl.obolibrary.org/obo/hp.obo
  It differs from the standard GA4GH schema's :ref:`id <apidesign_object_ids>`
  in that it is a URI pointing to an information resource outside of the scope
  of the schema or its resource implementation.
  */
  string id;

My interpretation of this is that we are providing a PURL for an ontology source, not the ID for a specific term (which is what the ID is currently interpreted as in the reference server). What is the correct interpretation of this field?

mbaudis commented 8 years ago

Yes, this is not very clear terminology. Maybe @helenp could clean this up ... ?

mellybelly commented 8 years ago

You are very likely going to need versioned IRIs for whole ontologies, individual entity (class or property) URIS, and some mechanism for defining subgraphs. RE CURIES, somewhere else we suggested the inclusion of a registry of preferred CURIE prefixes for specific slots in the schemas.

For example, you can see Monarch's registry example here: https://github.com/monarch-initiative/dipper/blob/master/dipper/curie_map.yaml

(we also have a community registry here, if you would like to add GA4GH prefixes to it: https://github.com/prefixcommons/biocontext)

david4096 commented 8 years ago

@mellybelly thanks for your input! The current ontology term definition is as follows.

message OntologyTerm {
  // Ontology source identifier - the identifier, a CURIE (preferred) or PURL
  // for an ontology source e.g. http://purl.obolibrary.org/obo/hp.obo It
  // differs from the standard GA4GH schema's :ref:`id <apidesign_object_ids>`
  // in that it is a URI pointing to an information resource outside of the
  // scope of the schema or its resource implementation.
  string id = 1;

  // Ontology term - the representation the id is pointing to.
  string term = 2;

  // Ontology source name - the name of ontology from which the term is obtained
  // e.g. 'Human Phenotype Ontology'
  string source_name = 3;

  // Ontology source version - the version of the ontology from which the
  // OntologyTerm is obtained; e.g. 2.6.1. There is no standard for ontology
  // versioning and some frequently released ontologies may use a datestamp, or
  // build number.
  string source_version = 4;
}

@mellybelly I believe we have whole ontology description and versioning (via sourceVersion and sourceName) and individual entity (via id). This allows the exact version of the ontology used to provide a given term. We include the term for convenience, and could do the same with description, but I believe we just need to provide guidance on how to use the object provided.

For example, Ensembl hosts an Ontology lookup service that expects IDs to be a CURIE. For example, GO:123456 provides the search for identifier 123456 in the Gene Ontology. However, since the GA4GH Ontology Term has fields for each relevant element (sourceName, sourceVersion), there is no need to use compact URIs. I suggest that we change id to be source_id and move any information that is a URI to a field named as such. The comments should change to reflect this fact, and the full URI to the Ontology Source could be provided.

diekhans commented 8 years ago

If we are having to guess at the meaning, the documentation needs improved.

david4096 commented 8 years ago

tl;dr my suggestions: replace id with source_id, remove mentions of PURL and CURIE on that field, and if a field is expected to have a URI, say so in the field name (i.e. source_uri).

diekhans commented 8 years ago

We need to rename `id' field, as that conflicts with the standard use of that field name in GA4GH. I don't know what a better term is, @cmungall ?

David Steinberg notifications@github.com writes:

tl;dr my suggestions: replace id with source_id, remove mentions of PURL and CURIE on that field, and if a field is expected to have a URI, say so in the field name (i.e. source_uri).

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.*

diekhans commented 8 years ago

ontology_id is probably a good name. I believe PURL/ CURIE requirements should remain.

David Steinberg notifications@github.com writes:

tl;dr my suggestions: replace id with source_id, remove mentions of PURL and CURIE on that field, and if a field is expected to have a URI, say so in the field name (i.e. source_uri).

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.*

david4096 commented 8 years ago

Can we gather consensus on renaming id to ontology_id and requiring that the ontology_id be a CURIE? The alternative would be to denormalize the data into the message fields as I have proposed.

cmungall commented 8 years ago

ontology_id suggests an identifier for the ontology itself. Other possibilities may be drawn from [ontology_]{term,class,entity}_id

Is there a technical problem with it being simply id? I assume the reason is there is a reasonable desire for consistent semantics of fields across the schema. The semantics needn't conflict if there is conceptually a super-property of id with strict semantics (e.g global uniqueness) but loose syntactic requirements, and within each sub-schema each instance of id retains the semantics but may impose stricter syntactic requirements.

Agreed this should be a CURIE

david4096 commented 8 years ago

You're right in your presumption that there's no technical reason to rename the id field. It is only meant to clarify that we are describing an external resource. However, this issue can be closed without addressing that.

For OntologyTerm.id, the documentation would be more clear to remove the mention of PURL.

  // Ontology source identifier - the identifier, a CURIE. It differs from 
  // the standard GA4GH schema's :ref:`id <apidesign_object_ids>`
  // in that it is a URI pointing to an information resource outside of the
  // scope of the schema or its resource implementation.

Then, in addition to source name we could provide source_uri.

  // The PURL to the source for the Ontology Term.
  string source_uri string = n;

Then the source name could be left to the user to specify, and the stricter source_uri is a named field that states it is expected to point at an exact resource. @jeromekelleher @cmungall what do you think? Alternatively, we could omit the source_uri and add a comment to the source_name that it expects a PURL.

diekhans commented 8 years ago

@cmungall, good point, term_id is clearer.

The problem with simple id' is that all other objects in the GA4GH API have anid' field that is a server-instance local identifier assigned by the server. It was a choice of terminology that was modeled after the single instance web-APIs and is nonintuitive to the federated, bioinformatics way of thinking.

I will leave source_name vs source_uri (or does the API use `url') discussion to the people who know the finer points of ontologies.

There is the potential of a lot of OntologyTerm objects, so I am a little worried about size explosion. Not worried enough to change anything until we have a real problem.

Chris Mungall notifications@github.com writes:

ontologyid suggests an identifier for the ontology itself. Other possibilities may be drawn from [ontology]{term,class,entity}_id

Is there a technical problem with it being simply id? I assume the reason is there is a reasonable desire for consistent semantics of fields across the schema. The semantics needn't conflict if there is conceptually a super-property of id with strict semantics (e.g global uniqueness) but loose syntactic requirements, and within each sub-schema each instance of id retains the semantics but may impose stricter syntactic requirements.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.*

mbaudis commented 8 years ago

@diekhans @cmungall We had a switch from id to ontologyId before, which didn't get propagated somehow. The Ontology Term Development Page uses ontologyId consistently, with all examples being URIs.

I agree with the description by @david4096 about a clarification to be made between id and url. No solution from my side, though; however, IMO never trust a URI to be persistent.

mbaudis commented 8 years ago

... regarding the term_id: no problem with that; this, and term_uri would cover the bases. However introduces some redundancy.

mbaudis commented 8 years ago

This should be addressed. I am all for replacing OntologyTerm.id with OntologyTerm.term_uri (required). Again, simply id doesn't fit the schema's usage mode. Alternative would be OntologyTerm.term_id, in case it should not be restricted to URIs (no idea if there is reasoning for that).

Please "vote" for id | term_uri | term_id | other_brilliant_name (esp. @mellybelly, @cmungall, @helenp, @mcourtot).

mcourtot commented 8 years ago

ICD-O, which we were looking at with@mbaudis, http://codes.iarc.fr/codegroup/2, doesn't seem to have URIs for its terms (though SNOMED seem to at least partially map to them, e.g. the annotation property ICD-O-3 Code at http://purl.bioontology.org/ontology/SNOMEDCT/76345009)

Do we expect that all ontologies we will be using provide URI for their terms, and when they do not some would be provided, maybe using identifiers.org via the prefixcommons registry @mellybelly mentions above?

If yes, and we expect all terms to have CURIEs I'd suggest OntologyTerm.term_curie.

david4096 commented 8 years ago

Thanks @mcourtot! Your comment leads me to believe we cannot expect a URI from every Ontology Term! Bear with me for a second.

As a developer I need a way to reliably and uniquely identify ontology terms within the ontology source. If I reproduce whichever sort of string they use to identify their term, then I've done my job because you'll be able to retrieve it in the Ontology source if you need to. If one Ontology uses CURIEs and another does not, it isn't terribly important what I put in that field as long as it can be used to get at the term in the source.

id in the case of OntologyTerm is meant to uniquely identify the term within the Ontology itself. But as @jeromekelleher notes, we shouldn't use id since this has preserved meaning for messages in the data model.

I propose that id is changed to term_id and that is serves the same purpose in the data model it has to this point: a string that can be used to uniquely identify the term in the source. This improves the situation regarding the usage of reserved id and serves the same use case. We could further improve the richness of the model by adding an optional term_uri field, for when an ontology source uses them or they are readily available. You can further increase the granularity of the by adding the specific term_curie and term_purl fields in addition to term_id.

The documentation would change to reflect that the term_id is: "a string that can be used to uniquely identify the term in an ontology source. Depending on the Ontology source this might be a CURIE, a string of digits, or a complete URL."

// An ontology term describing an attribute. (e.g. the phenotype attribute
// 'polydactyly' from HPO)
message OntologyTerm {
  // A string that can be used to uniquely identify the term in an ontology source. Depending on the
  // Ontology source this might be a CURIE, a string of digits, or a complete URL.
  string term_id = 1;

  // Ontology term - the representation the id is pointing to.
  string term = 2;

  // Term CURIE - A CURIE pointing to the term i.e. SO:0001234
  string term_curie = 2;

  // Term PURL - A PURL pointing to the term i.e. http://purl.bioontology.org/ontology/SNOMEDCT/76345009
  string term_purl = 3;

  // Ontology source name - the name of ontology from which the term is obtained
  // e.g. 'Human Phenotype Ontology'
  string source_name = 4;

  // Ontology source version - the version of the ontology from which the
  // OntologyTerm is obtained; e.g. 2.6.1. There is no standard for ontology
  // versioning and some frequently released ontologies may use a datestamp, or
  // build number.
  string source_version = 5;
}
mbaudis commented 8 years ago

This is an excellent comment @david4096. I was actually holding back from suggesting to populate OntologyTerm with the addtl. attributes you did suggest (not thought about the purl, though).

A problem here is that you introduce multiple addresses to the same external reference, which may conflict (sloppy data editing, change of ontology resources ...). Options (not claiming to be exhaustive here..):

But I have no problem to go with the "maximal version", if it is accepted not to cause headaches.

Let's hash this out ASAP.

mcourtot commented 8 years ago

Thanks for the feedback @mbaudis and @david4096. I like term_id, but would suggest the definition be slightly amended to The unique identifier of the term in an ontology source. This should be a CURIE (e.g., SO:0001234), but may be an alphanumerical string (e.g., 8000/3) if no CURIE are available.

Specifically, I'd like to discourage use of full URLs in this field, as it has the potential to conflict with the way we identify the OntologySource. It also makes the order of preference clear.

I'm wondering if we need a separate term_curie field: at best it will duplicate the info from term_id, neutral case it will be empty, but at worse it could conflict with the term_idfield (consider for example GO:1234567 and GO_1234567, where one is the CURIE but the other could be considered as 'local id' in an OWL file serialised as RDF/XML)

I like the idea of term_PURL, because I can see it being attractive to just use that field to dereference ontology terms rather than try and concatenate term_id and the Ontology source information. I'd suggest term_uri to be more generic, as many URIs are not PURL based. As @mbaudis mentions it however has the potential to conflict and not be very flexible when URIs change and we may need a precedence rule if we decide to go that way, so maybe if we have a way to reconstruct the URI we would be better off not including the extra attribute?

With respect to the ontology source identification. At the moment, there is: Ontology source name - the name of ontology from which the term is obtained, e.g. 'Human Phenotype Ontology': I think this may give rise to multiple values for the same resource, e.g., 'Human Phenotype Ontology', 'HP', 'HPO' which is IMO not desirable. A solution to this is using a registry such as the ones @mellybelly mentions above - it also has the advantage that we can add pretty much whichever information we want to this registry, e.g. if the CURIE prefix expansion changes we can update that seamlessly (e.g. we have been using http://purl.bioontology.org/ontology/SNOMEDCT/{sctid} but then decide to update to the official SNOMED URIs as per http://doc.ihtsdo.org/download/doc_UriStandard_Current-en-US_INT_20140527.pdf, and it should be updated to http://snomed.info/sct/{sctid}) We could then have Ontology source name - the name of ontology from which the term is obtained, e.g. 'HP', as taken from the GA4GH resources registry.

Maybe a rule like 'the ontology source name + the information from the registry + the term_id should allow to rebuild a termuri"? For example using 'HP' we would get the line `HP = http://purl.obolibrary.org/obo/HPfrom the registry, and we can add the term_id value (in this case we would want to add only the numerical part of the CURIE due to the : vs _ issue). We could also getHP = http://purl.obolibrary.org/obo/` and replace the : by an _ in the term_id. @mellybelly, @cmungall: I see you want with the former for Monarch - any specific reason?

david4096 commented 8 years ago

I've opened a PR that attempts to close this issue by changing the id to term_id, changing the comments as suggested, and adding the option to specify a term_uri if it is available. https://github.com/ga4gh/schemas/issues/694

I've added an issue to continue discussion of the source_name field here.

cmungall commented 7 years ago

Remember, a CURIE that is separated from it's context or prefix declarations is invalid.

https://www.w3.org/TR/curie/

There MUST be a prefix binding for the prefix (or the default prefix, if the prefix is absent) in scope

If we were working with json-ld then there is a defined way to specify prefixes in any given document. In protobuf, CURIES are just strings, so we must roll our own equivalent to prefix declarations or contexts, and clearly document them.

For example, we could have it such that each version of the API is associated with exactly one immutable set of prefix declarations. Another option is that these can be specified as a map which is passed as extra arguments and returned as part of the payload. But these have to be defined before CURIEs are CURIEs

david4096 commented 7 years ago

See in progress implementation here: https://github.com/ga4gh/server/pull/1523

We need to offer some way to easily find the full URI from the CURIE, I believe.