Closed cmungall closed 7 years ago
Please be aware that we'll be working on this in a coordinated effort of DWG-MTT and CWG. From some discussions with Melissa Haendel I have the understanding that ontology + id + name + version + URI/CURIE seems to cover most concepts; but we want to do some model implementations. For human disease descriptions, there are also a number of classification systems which will have to be accommodated.
OK, I will discuss this more with Melissa (@mellybelly) later today
The MME is hoping to converge on a compatible representation. Curious if there are any updates?
Not sure if there are updates from other WGs, but I think MME should continue to use CURIEs of the form HP:nnnnnnn, these are at least compatible with what major databases are using, and will the semweb stack (assuming default prefix declarations)
Okay, thanks. Will do.
IMHO specific implementations may define their more restrictive use of specific formats, e.g. it is fine for MME to restrict to CURIEs. In the general context, we can not restrict to use only specific ontologies.
I agree, we should not restrict to specific ontologies, though we can certainly recommend and test using a given set. Ideally we can stick to CURIEs and standardize prefixs as we see lots of messes where this has not been done.
CURIEs with standardized prefixes (as @mellybelly suggested) appear to be a viale solution for MME groups right now. Using JSON-LD sounds interesting though. Maybe this could be an optional part / use as part of the GA4GH schemas.
This has been dormant since January. I'm closing this in 2 days unless there are objections.
It's still not resolved.
https://github.com/ga4gh/schemas/blob/master/src/main/resources/avro/ontologies.avdl
still calls the OBI URI an id. Unless the terminology is resolved and a standard way of identifying ontology classes is specified everyone will choose a different convention and complex ID processing code will be required to interoperate.
@cmungall Agreed. There was a presentation before the easter break from the team behind the "FAIR data" principles. I was really impressed with their work, and I wonder if they have come up with a way to resolve this issue. I will direct him towards this.
For discussion: split into id and url in https://github.com/ga4gh/metadata-team/blob/master/avro-playground/metadata_redo.avdl
I think there is still the possibility for large confusion here. It mixes the concept of a URL fragment with an ID; there are sometimes but not always the same thing. Also "ID defined by external ontology source" doesn't really mean anything in a lot of cases. OBI do not define IDs anywhere. Their currency is URIs.
There is nothing here to prevent the scenario where people refer to the same OBI class as
Where is the authoritative list of CURIEs
Define authoritative? Several large and equally reputable groups use different formats.
Precisely. Is there one which will work well? If we can't refer to a set of these is it practical to use CURIEs - I am not against this, seems like the next obvious question.
It's difficult for sure. I know EBI are currently working on the next itteration of their ontology lookup service. As part of this, the data is searchable. For HPO, the term IDs are for example "hpo:http://purl.obolibrary.org/obo/HP_0200117". They also store an "id_annotation" as "HP:0200117", but also a list of "short_form" terms, which includes "HP_0200117".
As far as very useful sources of data, EBI comes up pretty near the top. They've been looking at this, and it looks like the result is there's no concensus. Currently, you need to use the format HP:0200117 to find the term by ID, but the current system is old, and a large update is coming.
I'm increasingly thinking we need an Ontology task team / working group (possibly as a sub of the DWG). /cc @ga4gh/global-alliance-contributors
this could be the perfect connector to ELIXR
Barend Mons sent from a mobile device Barend.mons@dtls.nl Barendmons@gmail.com
On May 22, 2015, at 14:32, Ben Hutton notifications@github.com wrote:
I'm increasingly thinking we need an Ontology task team / working group (possibly as a sub of the DWG). /cc @ga4gh/global-alliance-contributors
— Reply to this email directly or view it on GitHub.
Hi Ben,
You have echoed a couple of recent conversations from other groups. This is a subject that covers so many different areas of work and is a real restriction on progress. There is a dedicated meeting at Leiden (on 9th June) to discuss it and I wouldn’t be surprised to see a new task team come out of that.
Cheers David
On May 22, 2015, at 1:32 PM, Ben Hutton notifications@github.com wrote:
I'm increasingly thinking we need an Ontology task team / working group (possibly as a sub of the DWG). /cc @ga4gh/global-alliance-contributors https://github.com/orgs/ga4gh/teams/global-alliance-contributors — Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/165#issuecomment-104648869.
Ben - Happy that you find my group's resources useful and yes we are rewriting OLS. I am not convinced that splitting the ontology effort further is desirable - meta data has a large group of ontologists included already. More ontologists is not always a good thing.
@helenp Sure! I'm not nessecerally suggesting more people are involved, but more specifically that it's made clear to everyone that the issues around ontologies is being looked at. Something like defining a list of CURIEs or a method of ontology term identification, that's then pushed out to all the working groups, would be hugely benificial. I felt that formalising the work on ontologies would also allow the other working groups to know where to go to get answers / spear discussion about topics, and have clear document products as a result.
@D-lloyd I saw this on the agenda. I don't know what time has been allocated within that section, but I think it's key to discuss / hear what's happening with the OLS rewrite. I'm currently using one of the OLS tools in development to extract an ontology file to Solr for searching. A standardised way of importing data, not only is very benifical to others by making the use of ontologies easier, but also well help inform agreements on which of the multiple term id representations (CURIEs) the Global Alliance should suggest / push for.
I have pointed @simonjupp at this thread (OLS dev) we should be able to do something that helps, Ben come and see us if it helps.
Well, we have an ontologies task team in the CWG already, but I agree that there are technical specifications that need work up that are likely out of scope for the CWG, which is focused more on use cases. We should discuss this in Leiden.
I prefer use of CURIES and it is likely that the GA4GH would want to have a registry of whatever is in use throughout all the schemas. It doesn't have to be about authority (as there are many overlapping "authoritative" sources) but rather what is required by any users/contributors of the GA4GH schemas to share their data. Perhaps we should consider some process by which anyone sharing data via GA4GH can register their CURIES in a shared repo. We'd need some inclusion/exclusion criteria and guidelines for contributors.
There is also work to be done to specify where and when certain ontology sources should or could be used. This is the much harder part ;-).
@nlwashington @kshefchek @cmungall perhaps we can do an example in G2P for how this might work with a diversity of disease and phenotype ontology sources, we are on the way towards that already.
@helenp Already in contact with Simon! Very helpful! Using the java code in the new OLS project to load in ontology data to Solr. Waiting for his return to ask further question on how I can integrate this data! =]
@mellybelly Another repository would most probably add to the confusion. Versioning is a major issue with ontologies! It's already a very messy problem. It looks like the OLS will allow you to look up terms based on multiple formats. Directing people to the new OLS should hopefully be really helpful. I understand it will be updated nightly, but of course one can always fix a versioning and run their own solr install.
@Relequestual - I'll answer in more detail later, but answering your original question and the HP example. In a semantic web toolchain the URI is canonical. For any OBO library ontology, there is an authoritative deterministic way to map this to an identifier in OBO format, which is what is used by all bioinformatics dbs not based on a semweb tool chain, and the id would be HP:0200117
.
More later...
@Relequestual did not mean to imply a new registry at all, quite the contrary. You are preaching to the choir wrt versioning, I am all too familiar with that problem.
I don't think the problem here is where to go to look up terms based on different formats or sources, we have plenty of places to do that (OLS, NCBO, Ontobee, NCI metathesaurus, etc.)
Some issues we need to think about are: a) how to reference an ontology class or property in the avro schemas, taking into account the provenance b) how to specify the domain or range in any given context, and allowable/useful sources - e.g. as per above comments, we don't want to limit people in their choices of ontologies as there are different needs, but at the same time, we want to limit the misuse of incorrect semantic types (e.g. not equate a variant with a gene, or a phenotype with a disease) c) how to represent IDs. You can see an example from Monarch where we have a CURIE map in a YAML file: https://github.com/monarch-initiative/dipper/blob/afd2d11bd6356f37333c5514c8d38071e02f1e58/dipper/curie_map.yaml
@cmungall After thinking about this quite a bit over the past few weeks, I agree with your advocacy of option 4.
I'm not sure this issue was discussed at the GA4GH plenary in Leiden last week. @cmungall Your finding of the obo foundary id policy link is most informative. Unless there's any reason not to, it seems like a good definative method for solving this problem.
Question is, what does the solution look like in terms of updating the schemas?
Ben, we will discuss on the meta data call thursday 18 June.
We had some discussions in Leiden (with @mellybelly, @nlwashington, @diekhans). The current majority opinion is to have "polymorphic" ways to reference ontologies, and also to include a way to reference/use local ontologies (e.g. if no best fit reference is known/found, or for legacy => convert later etc.). Also, for records like "disease" (which has yet to be defined as record type), we found the most sensible way to allow references to multiple ontologies, with one being the primary (doodling here):
primaryReference ontologyTerm;
alternativeReferences<array<ontologyTerm>>;
Still, there hasn't been implementation work on the exact format of the ontologyTerm
object; everybody is welcome, regarding the notes above ...
@helenp If you could write up and post the meeting notes afterwards here, that would be very helpful. I'm still waiting for my approval to go through for the GA4GH DWG mailing list.
Thanks, Paul
@selewis are you attending the call on June 18?
My proposal is as follows:
id
in the OntologyTerm object ( this one ) be constrained to contain CURIEs (e.g. HP:0001234
)"HP": "http://purl.obolibrary.org/obo/HP_"
Note that (1) cannot be enforced within Avro AFAICT, but it would be trivial to write some kind of checker as an additional layer.
This proposal can be seen as a subset of proposal #311 to use JSON-LD ubiquitously - however, the proposal in this ticket is in no way dependent on GA4GH endorsing JSON-LD in whole or in part. We can proceed independently.
For (2), the CURIE map could live within the GA4GH github repository (and sync with external sources), or it could point outwards to an externally maintained set of CURIE prefixes (e.g. this obo context. Note there is no requirement for programmatic consumers or producers of GA4GH json, avro, services to be able to process a prefix map or json-ld file. The only imposition to developers in this proposal is that consistent ID formats are used. The prefix map will be primarily a social contract to ensure that the same class is referred to in the same way.
This proposal is neutral w.r.t whether a single ID or multiple IDs are used in an annotation (e.g. the disease scenario, where someone may want to record a NCIt class and a SNOMED class and a DOID class).
Some schemas (e.g. MME ) may opt not use the OntologyTerm
container and instead use a direct reference to an id field. In this case, I would recommend the same guidelines are followed, as if the id were inside an OntologyTerm
container.
This proposal does not explicitly address versioning, but is compatible with a number of different schemes. As a strawman:
record OntologyTerm {
/**
A prefixed identifier (CURIE) such as `OBI:0001271`
*/
string id;
/**
The value of the owl:versionIRI field in the ontology
*/
union {null, string} versionIRI;
}
@pgrosu Don't you have access to the MTT rolling minutes? Could you pls. send me/Stephen your email?
Michael.
On 16 Jun 2015, at 21:31, Paul Grosu notifications@github.com wrote:
@helenp https://github.com/helenp If you could write up and post the meeting notes afterwards here, that would be very helpful. I'm still waiting for my approval to go through for the GA4GH DWG mailing list https://groups.google.com/forum/#!topic/ga4gh-dwg/Kf2Xaj31NWc.
Thanks, Paul
— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/165#issuecomment-112539234.
Hi Michael,
My email is pgrosu@gmail.com though I didn't think it would have been appropriate for me to join the call until I would be officially on the mailing list.
Regrading the MTT rolling minutes - I think I didn't scroll far enough - but would they be following?
Thank you, Paul
So far we have deferred a lot of the schema to use OntologyTerm
or a list thereof. However, for many attributes and/or use cases, this won't be enough:
geneticSex
)There will be many instances were e.g. the definition of a disease term won't be captured through a single OntologyTerm
instance, or even through a list of those.
Now this can in principle be addressed in IMHO 3 ways:
OntologyTerm
object; however, they would have to be structurally well designed to be in line with external ontologiesOntologyTerm
object, thereby allowing ad hoc storage of data without ontology curationOntologyTerm
, i.e. making the latter optionalIn my opinion, the last variant is the most practical when considering a variety of scenarios (i.e. one can immediately make a local dataset GA4GH schema compatible for data mining purposes). Obviously, solutions making use of the schema should provide appropriate interfaces for making use of appropriate ontologies, and the schema documentation should encourage this. But leaving a a reference to external ontologies as the only way to describe relevant attributes doesn't seem a viable option.
Or provide proof of how these and more scenarios would be handled...
Not much time to write, but one of the distinctions we discussed needing to make is regarding the semantics for when multiple terms are chosen. For example, if two disease terms are indicated, it would likely mean that the patient has two diagnoses (or two family members with them, or whatever the context). This is distinct from assigning two terms from different vocabs as alternatives, as @mbaudis indicates above. Then there are the semantics that might already be present between two terms indicated in this way. We also agreed that some uses of ontologyTerm would specify a single entity (e.g. you can only have one geneticSex), whereas others would expect an array (a set of phenotypes).
Everyone largely seemed to agree to use of CURIES and a CURIE map
Also, @mbaudis @diekhans and @nlwashington and I discussed compliance testing that would leverage OWL reasoning as part of the reference implementation to ensure best use.
Use of non-registered CURIES would go through a registration request to check appropriate usage (more than constraining people) or could be a local extension. I think we'd largely want to discourage local extensions, but some good documentation about how to best include and document them could go a long way.
The compliance suite would also check for consistent ID formats and unregistered CURIES, pointing people to the registration page or make alternative suggestions based on existing OWL file equivalencies/xrefs.
@helenp OK. I'd be interested to see the minutes from this meeting as I'm not part of the Metadata TT.
Hi Chris,
Since we area creating a data exchange API, we need to be able to handle a lot of legacy data that might not conform to the desired format. This is the idea behind of local `ontologies'. Even if we could, it would be a difficult to create validation as part of the schema.
As you suggest, creating validation programs is a great solution. It allows tuning for the data set and create more comprehensiveness validation than can be created by declaration alone.
Cheers, Mark
Chris Mungall notifications@github.com writes:
@selewis are you attending the call on June 18?
My proposal is as follows:
- The field id in the OntologyTerm object ( this one ) be constrained to contain CURIEs (e.g. HP:0001234)
- GA4GH endorses a set of CURIE prefixes that are consistent with the standard URIs used for classes in that ontology, e.g. "HP": "http:// purl.obolibrary.org/obo/HP_"
Note that (1) cannot be enforced within Avro AFAICT, but it would be trivial to write some kind of checker as an additional layer.
Note that this proposal can be seen as a subset of proposal #311 to use JSON-LD ubiquitously - however, the proposal in this ticket is in no way dependent on GA4GH endorsing JSON-LD in whole or in part.
For (2), the CURIE map could live within the GA4GH github repository (and sync with external sources), or it could point outwards to an externally maintained set of CURIE prefixes (e.g. this obo context. Note there is no requirement for programmatic consumers or producers of GA4GH json, avro, services to be able to process a prefix map or json-ld file. The prefix map will be primarily a social contract to ensure that the same class is referred to in the same way.
This proposal is neutral w.r.t whether a single ID or multiple IDs are used in an annotation (e.g. the disease scenario, where someone may want to record a NCIt class and a SNOMED class and a DOID class).
Note that some schemas (e.g. MME ) may opt not use the OntologyTerm container and instead use a direct reference to an id field. In this case, I would recommend the same guidelines are followed.
This proposal does not explicitly address versioning, but is compatible with a number of different schemes. As a strawman:
record OntologyTerm { /* A prefixed identifier (CURIE) such as
OBI:0001271
/ string id;/** The value of the owl:versionIRI field in the ontology */ union {null, string} versionIRI;
}
— Reply to this email directly or view it on GitHub.*
Hi Helen,
Thank you for the minutes, which are very helpful in getting me caught up with the project. I am still carefully going through them, and previously I was referring regarding waiting to join the DWG list in order to join those calls - though the MTT ones could be quite pertinent for me as well. Having worked with interfacing with ontologies before, I would like to get up to speed on the materials before joining the MTT calls, since there is quite a lot to catch up to.
I was unaware of the MTT minutes, which I think many would find very helpful to properly contribute to. It might be very helpful if the link to the minutes from all the teams are posted on the GA4GH website and on GitHub (https://github.com/ga4gh/). This would probably be the quickest way for people to synchronize on all the projects.
Thank you, Paul
On Tue, Jun 16, 2015 at 5:20 PM, Helen Parkinson notifications@github.com wrote:
@pgrosu https://github.com/pgrosu I think we can just add you for the next call June 18th if that's of interest to you. MTT minutes here https://docs.google.com/document/d/1QXKjGJCRlHu6AUPNL0-wjOVe-6_55p1DQ2CSGjlxelk/edit
— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/165#issuecomment-112572871.
@pgrosu - there's a lot of process documentation in the minutes. The MTT is now ticketing all relevant items and better documenting these so that they are standalone. My preference is to use tickets as they are cleaner.
@helenp Ah, makes sense. Would these tickets be through MetadataTaskTeam
labels as follows, or via another method (i.e. a different stored location):
https://github.com/ga4gh/schemas/labels/MetadataTaskTeam
Knowing the method would able people to quickly get a glance on the status, and not fall behind on the progress.
Thanks, Paul
@pgrosu Labels. We have done some clean up
@helenp Super, thank you :)
On Tue, Jun 16, 2015 at 12:35 PM, Chris Mungall notifications@github.com wrote:
@selewis https://github.com/selewis are you attending the call on June 18?
Yes, I'll be on the call. But I have a lot of homework to do to catch up with everything that happened while I've been away.
My proposal is as follows:
- The field id in the OntologyTerm object ( this one https://github.com/ga4gh/schemas/blob/master/src/main/resources/avro/ontologies.avdl#L21 ) be constrained to contain CURIEs (e.g. HP:0001234)
- GA4GH endorses a set of CURIE prefixes that are consistent with the standard URIs used for classes in that ontology, e.g. "HP": " http://purl.obolibrary.org/obo/HP_"
Note that (1) cannot be enforced within Avro AFAICT, but it would be trivial to write some kind of checker as an additional layer.
Note that this proposal can be seen as a subset of proposal #311 https://github.com/ga4gh/schemas/issues/311 to use JSON-LD ubiquitously
- however, the proposal in this ticket is in no way dependent on GA4GH endorsing JSON-LD in whole or in part.
For (2), the CURIE map could live within the GA4GH github repository (and sync with external sources), or it could point outwards to an externally maintained set of CURIE prefixes (e.g. this obo context https://raw.githubusercontent.com/cmungall/biocontext/master/registry/obo_context.jsonld. Note there is no requirement for programmatic consumers or producers of GA4GH json, avro, services to be able to process a prefix map or json-ld file. The prefix map will be primarily a social contract to ensure that the same class is referred to in the same way.
This proposal is neutral w.r.t whether a single ID or multiple IDs are used in an annotation (e.g. the disease scenario, where someone may want to record a NCIt class and a SNOMED class and a DOID class).
Note that some schemas (e.g. MME https://github.com/MatchmakerExchange/mme-apis/blob/master/search-api.md#example ) may opt not use the OntologyTerm container and instead use a direct reference to an id field. In this case, I would recommend the same guidelines are followed.
This proposal does not explicitly address versioning, but is compatible with a number of different schemes. As a strawman:
record OntologyTerm { /* A prefixed identifier (CURIE) such as
OBI:0001271
/ string id;/** The value of the owl:versionIRI field in the ontology */ union {null, string} versionIRI;
}
— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/165#issuecomment-112540419.
In the long history of humankind (& animal-kind too) those who learned to collaborate & improvise most effectively have prevailed - Charles Darwin
@mbaudis you mentioned a few days ago "Still, there hasn't been implementation work on the exact format of the ontologyTerm object; everybody is welcome, regarding the notes above ...". in creating the FuGE standard a few years back, we thought long and hard about this. the UML model we came up with was this: we were interested in allowing both simple cases (i.e. no properties involved) and the ability to describe more complex terms, such as an automobile that would have properties for such things as engine, tires, etc. a brief usage document is here. the UML was implemented as XML for the standard but was based on a simplified RDF.
Aside: not sure what the GA4GH protocol is here but it feels like we should be spinning new issues here?
@mdmiller53 - thanks for sharing the doc. I'm not sure it precisely aligns to GA4GH requirements (though we may all have different ideas about what these are). The typical usage would be to represent an ontology class (rather than property or individual, if by individual you mean something like owl individual). There are situations where we may want to denote a property (aka relation) too (for example, in a generic functional annotation model). There may be situations where we want to model composition of ontology terms (see this UML ) but this is probably best discussed as a separate issue from the format of the class references.
Hi I have been traveling like crazy (could not even attend the meetign in my home town) and apologize for not being in many calls lately, but I suppose that once Beacons go 'ontology' we adhere to FAIR and ELIXIR interop. developments? Regards
Barend Mons sent from a mobile device Barend.mons@dtls.nl Barendmons@gmail.com
On Jun 19, 2015, at 01:26, Chris Mungall notifications@github.com wrote:
Aside: not sure what the GA4GH protocol is here but it feels like we should be spinning new issues here?
@mdmiller53 - thanks for sharing the doc. I'm not sure it precisely aligns to GA4GH requirements (though we may all have different ideas about what these are). The typical usage would be to represent an ontology class (rather than property or individual, if by individual you mean something like owl individual). There are situations where we may want to denote a property (aka relation) too (for example, in a generic functional annotation model). There may be situations where we want to model composition of ontology terms (see this UML ) but this is probably best discussed as a separate issue from the format of the class references.
— Reply to this email directly or view it on GitHub.
Hi All New to the group, so please excuse if this has already been worked out... ...for any specified phenotype ontology term, how will one distinguish between the different things you might want to communicate with/about that term, e.g.
etc The FuGE model seems like it might have that covered (via OntologyProperty??), or perhaps that is just about defining the term itself? Has this group yet got deeply into the differential use of ontologies in schemas, exchange and queries, as oppose to the means for specifying the term itself? Cheers Tony
Professor Anthony J Brookes Department of Genetics University of Leicester University Road Leicester, LE1 7RH, UK Tel: +44 (0)116 2523401
mdmiller53 wrote:
@mbaudis https://github.com/mbaudis you mentioned a few days ago "Still, there hasn't been implementation work on the exact format of the ontologyTerm object; everybody is welcome, regarding the notes above ...". in creating the FuGE http://fuge.sourceforge.net/dev/index.php#v1Final standard a few years back, we thought long and hard about this. the UML model we came up with was this: ontology https://cloud.githubusercontent.com/assets/1576739/8241367/c97cb676-15bd-11e5-98ab-72a7d40b864e.png we were interested in allowing both simple cases (i.e. no properties involved) and the ability to describe more complex terms, such as an automobile that would have properties for such things as engine, tires, etc. a brief usage document is here http://fuge.sourceforge.net/presentation/fuge_ontology_best_practice.doc. the UML was implemented as XML for the standard but should map to RDF easily.
— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/165#issuecomment-113282379.
@antbro Can we keep to the issue of the topic please? =] Do by all means create a new issue!
Current docs state:
This is fairly open ended and we can imagine confusion and inconsistent usage here.
For the ontologies currently referenced in the metadata schema, e.g.
Terms are typically referenced in two ways.
URIs/IRIs
For many biological ontologies these are typically obolibrary purls, which follow:
See: http://www.obofoundry.org/id-policy.shtml
OBO-Style identifiers
Typically follow the form
Options
Option 1 is probably the conceptually simplest. Option 2 is not very future proof as it doesn't allow open-ended expansion to any ontology out there on the semantic web. Option 3 is probably overkill.
I would advocate option 4. To elaborate, we allow the field to contain either a URI or a CURIE (https://en.wikipedia.org/wiki/CURIE see also http://www.w3.org/TR/curie/), without the brackets. We then assume the existence of a number of implicit qname prefixes. E.g.
This could potentially live in a separate JSON-LD context file.
This is also consistent with the translation in the OBO-Format spec: http://oboformat.googlecode.com/svn/trunk/doc/obo-syntax.html#5.9.1
I would be happy to branch and make a pull request, but I thought it worthwhile polling for opinions. Need this to be future-proof, consistent - but also not over-engineered.