Closed jeromekelleher closed 7 years ago
Yes, this is not very clear terminology. Maybe @helenp could clean this up ... ?
You are very likely going to need versioned IRIs for whole ontologies, individual entity (class or property) URIS, and some mechanism for defining subgraphs. RE CURIES, somewhere else we suggested the inclusion of a registry of preferred CURIE prefixes for specific slots in the schemas.
For example, you can see Monarch's registry example here: https://github.com/monarch-initiative/dipper/blob/master/dipper/curie_map.yaml
(we also have a community registry here, if you would like to add GA4GH prefixes to it: https://github.com/prefixcommons/biocontext)
@mellybelly thanks for your input! The current ontology term definition is as follows.
message OntologyTerm {
// Ontology source identifier - the identifier, a CURIE (preferred) or PURL
// for an ontology source e.g. http://purl.obolibrary.org/obo/hp.obo It
// differs from the standard GA4GH schema's :ref:`id <apidesign_object_ids>`
// in that it is a URI pointing to an information resource outside of the
// scope of the schema or its resource implementation.
string id = 1;
// Ontology term - the representation the id is pointing to.
string term = 2;
// Ontology source name - the name of ontology from which the term is obtained
// e.g. 'Human Phenotype Ontology'
string source_name = 3;
// Ontology source version - the version of the ontology from which the
// OntologyTerm is obtained; e.g. 2.6.1. There is no standard for ontology
// versioning and some frequently released ontologies may use a datestamp, or
// build number.
string source_version = 4;
}
@mellybelly I believe we have whole ontology description and versioning (via sourceVersion
and sourceName
) and individual entity (via id
). This allows the exact version of the ontology used to provide a given term. We include the term for convenience, and could do the same with description, but I believe we just need to provide guidance on how to use the object provided.
For example, Ensembl hosts an Ontology lookup service that expects IDs to be a CURIE. For example, GO:123456
provides the search for identifier 123456
in the Gene Ontology. However, since the GA4GH Ontology Term has fields for each relevant element (sourceName, sourceVersion), there is no need to use compact URIs. I suggest that we change id
to be source_id
and move any information that is a URI to a field named as such. The comments should change to reflect this fact, and the full URI to the Ontology Source could be provided.
If we are having to guess at the meaning, the documentation needs improved.
tl;dr my suggestions: replace id
with source_id
, remove mentions of PURL and CURIE on that field, and if a field is expected to have a URI, say so in the field name (i.e. source_uri
).
We need to rename `id' field, as that conflicts with the standard use of that field name in GA4GH. I don't know what a better term is, @cmungall ?
David Steinberg notifications@github.com writes:
tl;dr my suggestions: replace id with source_id, remove mentions of PURL and CURIE on that field, and if a field is expected to have a URI, say so in the field name (i.e. source_uri).
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.*
ontology_id is probably a good name. I believe PURL/ CURIE requirements should remain.
David Steinberg notifications@github.com writes:
tl;dr my suggestions: replace id with source_id, remove mentions of PURL and CURIE on that field, and if a field is expected to have a URI, say so in the field name (i.e. source_uri).
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.*
Can we gather consensus on renaming id
to ontology_id
and requiring that the ontology_id
be a CURIE? The alternative would be to denormalize the data into the message fields as I have proposed.
ontology_id
suggests an identifier for the ontology itself. Other possibilities may be drawn from [ontology_]{term,class,entity}_id
Is there a technical problem with it being simply id
? I assume the reason is there is a reasonable desire for consistent semantics of fields across the schema. The semantics needn't conflict if there is conceptually a super-property of id
with strict semantics (e.g global uniqueness) but loose syntactic requirements, and within each sub-schema each instance of id
retains the semantics but may impose stricter syntactic requirements.
Agreed this should be a CURIE
You're right in your presumption that there's no technical reason to rename the id
field. It is only meant to clarify that we are describing an external resource. However, this issue can be closed without addressing that.
For OntologyTerm.id
, the documentation would be more clear to remove the mention of PURL.
// Ontology source identifier - the identifier, a CURIE. It differs from
// the standard GA4GH schema's :ref:`id <apidesign_object_ids>`
// in that it is a URI pointing to an information resource outside of the
// scope of the schema or its resource implementation.
Then, in addition to source name we could provide source_uri
.
// The PURL to the source for the Ontology Term.
string source_uri string = n;
Then the source name could be left to the user to specify, and the stricter source_uri
is a named field that states it is expected to point at an exact resource. @jeromekelleher @cmungall what do you think? Alternatively, we could omit the source_uri
and add a comment to the source_name
that it expects a PURL.
@cmungall, good point, term_id is clearer.
The problem with simple id' is that all other objects in the GA4GH API have an
id' field that is a server-instance local
identifier assigned by the server. It was a choice of
terminology that was modeled after the single instance web-APIs
and is nonintuitive to the federated, bioinformatics way of
thinking.
I will leave source_name vs source_uri (or does the API use `url') discussion to the people who know the finer points of ontologies.
There is the potential of a lot of OntologyTerm objects, so I am a little worried about size explosion. Not worried enough to change anything until we have a real problem.
Chris Mungall notifications@github.com writes:
ontologyid suggests an identifier for the ontology itself. Other possibilities may be drawn from [ontology]{term,class,entity}_id
Is there a technical problem with it being simply id? I assume the reason is there is a reasonable desire for consistent semantics of fields across the schema. The semantics needn't conflict if there is conceptually a super-property of id with strict semantics (e.g global uniqueness) but loose syntactic requirements, and within each sub-schema each instance of id retains the semantics but may impose stricter syntactic requirements.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.*
@diekhans @cmungall We had a switch from id
to ontologyId
before, which didn't get propagated somehow. The Ontology Term Development Page uses ontologyId
consistently, with all examples being URIs.
I agree with the description by @david4096 about a clarification to be made between id and url. No solution from my side, though; however, IMO never trust a URI to be persistent.
... regarding the term_id
: no problem with that; this, and term_uri
would cover the bases. However introduces some redundancy.
This should be addressed. I am all for replacing OntologyTerm.id
with OntologyTerm.term_uri
(required). Again, simply id
doesn't fit the schema's usage mode.
Alternative would be OntologyTerm.term_id
, in case it should not be restricted to URIs (no idea if there is reasoning for that).
Please "vote" for id
| term_uri
| term_id
| other_brilliant_name
(esp. @mellybelly, @cmungall, @helenp, @mcourtot).
ICD-O, which we were looking at with@mbaudis, http://codes.iarc.fr/codegroup/2, doesn't seem to have URIs for its terms (though SNOMED seem to at least partially map to them, e.g. the annotation property ICD-O-3 Code at http://purl.bioontology.org/ontology/SNOMEDCT/76345009)
Do we expect that all ontologies we will be using provide URI for their terms, and when they do not some would be provided, maybe using identifiers.org via the prefixcommons registry @mellybelly mentions above?
If yes, and we expect all terms to have CURIEs I'd suggest OntologyTerm.term_curie
.
Thanks @mcourtot! Your comment leads me to believe we cannot expect a URI from every Ontology Term
! Bear with me for a second.
As a developer I need a way to reliably and uniquely identify ontology terms within the ontology source. If I reproduce whichever sort of string they use to identify their term, then I've done my job because you'll be able to retrieve it in the Ontology source if you need to. If one Ontology uses CURIEs and another does not, it isn't terribly important what I put in that field as long as it can be used to get at the term in the source.
id
in the case of OntologyTerm
is meant to uniquely identify the term within the Ontology itself. But as @jeromekelleher notes, we shouldn't use id
since this has preserved meaning for messages in the data model.
I propose that id
is changed to term_id
and that is serves the same purpose in the data model it has to this point: a string that can be used to uniquely identify the term in the source. This improves the situation regarding the usage of reserved id
and serves the same use case. We could further improve the richness of the model by adding an optional term_uri
field, for when an ontology source uses them or they are readily available. You can further increase the granularity of the by adding the specific term_curie
and term_purl
fields in addition to term_id
.
The documentation would change to reflect that the term_id
is: "a string that can be used to uniquely identify the term in an ontology source. Depending on the Ontology source this might be a CURIE, a string of digits, or a complete URL."
// An ontology term describing an attribute. (e.g. the phenotype attribute
// 'polydactyly' from HPO)
message OntologyTerm {
// A string that can be used to uniquely identify the term in an ontology source. Depending on the
// Ontology source this might be a CURIE, a string of digits, or a complete URL.
string term_id = 1;
// Ontology term - the representation the id is pointing to.
string term = 2;
// Term CURIE - A CURIE pointing to the term i.e. SO:0001234
string term_curie = 2;
// Term PURL - A PURL pointing to the term i.e. http://purl.bioontology.org/ontology/SNOMEDCT/76345009
string term_purl = 3;
// Ontology source name - the name of ontology from which the term is obtained
// e.g. 'Human Phenotype Ontology'
string source_name = 4;
// Ontology source version - the version of the ontology from which the
// OntologyTerm is obtained; e.g. 2.6.1. There is no standard for ontology
// versioning and some frequently released ontologies may use a datestamp, or
// build number.
string source_version = 5;
}
This is an excellent comment @david4096. I was actually holding back from suggesting to populate OntologyTerm
with the addtl. attributes you did suggest (not thought about the purl
, though).
A problem here is that you introduce multiple addresses to the same external reference, which may conflict (sloppy data editing, change of ontology resources ...). Options (not claiming to be exhaustive here..):
term_id
types, which can be resolved unambiguously (regex...)But I have no problem to go with the "maximal version", if it is accepted not to cause headaches.
Let's hash this out ASAP.
Thanks for the feedback @mbaudis and @david4096. I like term_id
, but would suggest the definition be slightly amended to
The unique identifier of the term in an ontology source. This should be a CURIE (e.g., SO:0001234), but may be an alphanumerical string (e.g., 8000/3) if no CURIE are available.
Specifically, I'd like to discourage use of full URLs in this field, as it has the potential to conflict with the way we identify the OntologySource. It also makes the order of preference clear.
I'm wondering if we need a separate term_curie
field: at best it will duplicate the info from term_id
, neutral case it will be empty, but at worse it could conflict with the term_id
field (consider for example GO:1234567 and GO_1234567, where one is the CURIE but the other could be considered as 'local id' in an OWL file serialised as RDF/XML)
I like the idea of term_PURL
, because I can see it being attractive to just use that field to dereference ontology terms rather than try and concatenate term_id
and the Ontology source information. I'd suggest term_uri
to be more generic, as many URIs are not PURL based. As @mbaudis mentions it however has the potential to conflict and not be very flexible when URIs change and we may need a precedence rule if we decide to go that way, so maybe if we have a way to reconstruct the URI we would be better off not including the extra attribute?
With respect to the ontology source identification. At the moment, there is:
Ontology source name - the name of ontology from which the term is obtained, e.g. 'Human Phenotype Ontology'
: I think this may give rise to multiple values for the same resource, e.g., 'Human Phenotype Ontology', 'HP', 'HPO' which is IMO not desirable. A solution to this is using a registry such as the ones @mellybelly mentions above - it also has the advantage that we can add pretty much whichever information we want to this registry, e.g. if the CURIE prefix expansion changes we can update that seamlessly (e.g. we have been using http://purl.bioontology.org/ontology/SNOMEDCT/{sctid} but then decide to update to the official SNOMED URIs as per http://doc.ihtsdo.org/download/doc_UriStandard_Current-en-US_INT_20140527.pdf, and it should be updated to http://snomed.info/sct/{sctid})
We could then have Ontology source name - the name of ontology from which the term is obtained, e.g. 'HP', as taken from the GA4GH resources registry
.
Maybe a rule like 'the ontology source name + the information from the registry + the term_id should allow to rebuild a termuri"? For example using 'HP' we would get the line
`HP = http://purl.obolibrary.org/obo/HPfrom the registry, and we can add the term_id value (in this case we would want to add only the numerical part of the CURIE due to the : vs _ issue). We could also get
HP = http://purl.obolibrary.org/obo/` and replace the : by an _ in the term_id. @mellybelly, @cmungall: I see you want with the former for Monarch - any specific reason?
I've opened a PR that attempts to close this issue by changing the id
to term_id
, changing the comments as suggested, and adding the option to specify a term_uri
if it is available. https://github.com/ga4gh/schemas/issues/694
I've added an issue to continue discussion of the source_name
field here.
Remember, a CURIE that is separated from it's context or prefix declarations is invalid.
There MUST be a prefix binding for the prefix (or the default prefix, if the prefix is absent) in scope
If we were working with json-ld then there is a defined way to specify prefixes in any given document. In protobuf, CURIES are just strings, so we must roll our own equivalent to prefix declarations or contexts, and clearly document them.
For example, we could have it such that each version of the API is associated with exactly one immutable set of prefix declarations. Another option is that these can be specified as a map which is passed as extra arguments and returned as part of the payload. But these have to be defined before CURIEs are CURIEs
See in progress implementation here: https://github.com/ga4gh/server/pull/1523
We need to offer some way to easily find the full URI from the CURIE, I believe.
The docs for
OntologyTerm.id
read as follows:My interpretation of this is that we are providing a PURL for an ontology source, not the ID for a specific term (which is what the ID is currently interpreted as in the reference server). What is the correct interpretation of this field?