Closed FuqiX closed 2 years ago
@FuqiX I can see several angles but we'd need more background information. Can you provide more details about the requests from NIST or interactions so far?
Regarding the recipes, I have generated a recipe about how to select ontologies
. I know you saw it and you pointed me to the following material:
As to 'Why to use ontologies`, we touched upon this but could be expanded. Here we should link with @oyadenizbeyan @iemam
as this relates to CMM
I see the recipes as sets of steps depicting how to do something. Maybe some kind of introduction/background section is needed, describing some foundational artifacts (e.g., ontologies) and WHY they are FAIR principles enablers.
Hi,
TL;DR We need a recipe/introductory section about the importance of ontologies and help users identify their ontology use cases and find the correct recipes for their use cases.
Background info:
The NIST consortium are building CRISPR data standards, which includes related ontology recommendations, checklists, and other FAIR practices. Similar to the transcriptomics data standards we worked with.
As Emiliano pointed out, they are looking for foundational artefacts, like ontologies and schemas, that are related to the FAIR principles. Maybe another introductory recipe/section of the book on the importance of fair data which links out to the individual sections of the cookbook.
Currently, our discussion focuses on ontologies. The first need is to provide an introduction about the importance and benefit of using ontology (that's the recipe I am requesting in this issue). Also, for people who are not familiar with ontologies, it would be nice to provide them with a general picture of this domain.
We have got an ontology recipe collection in the cookbook. However, it is not helpful unless we can also pull a description of how to put them together though. To discuss this further, Helen suggested we can add reciprocal links in OLS and point people to ontology recipes. The challenge is that our recipes are way down at a very granular level of detail and it is hard to navigate through this collection. There are some pieces missing, also they are not aligned with use cases or the FAIRification process.
@mcourtot Please feel free to add more details.
Hi @FuqiX, @proccaserra,
I agree that we could do with a more general introductory section on ontologies and I'd be happy to contribute to this (as long as it can wait until after the vF2F). I'm not sure it necessarily needs to be a full "Ontology 101" though. Plenty of excellent work has been done in that respect and the introduction should link out to existing papers, blogs and guides that cover the subject and tie it all together in the context of F+.
@daniwelter @FuqiX I have the following we probably build upon:
authors: Philippe Rocca-Serra
maintainers: Philippe Rocca-Serra
version: initial draft
license: CC0 1.0 Universal (CC0 1.0) Public Domain Dedication
The aim of this recipe is to provide a compact introduction about controled terminologies
and ontologies
, why these resources are central to the preservation of knowledge and data mining and how such resources are developed.
The need for controled vocabulary
often arises in situation where validation of textual information is necessary for operational requirements.
The main initial driver for data entry harmonization is to increase query recall. It is most basic form, keywords
may be used to perform indexation. However, if relying on user input alone, the chances of typographic errors increases with the number of users. These unavoidable events accumulate over time and end up hurting the accuracy of search results and this is the reason for offering sets of predefined values. It reduces the noise.
However this can come at the cost of precision, as the predefined terms may not cover the exact thing users may need to describe. Furthermore, term mis-selection by user is not eliminated and introduces another type of error.
A controled terminology
is a normative
collection of terms, the spelling of which is fixed and for which additional information may be provided such as definition
, a set of synonyms
, a editor
, a version
as well as a license
determining the condition of use. The set of information about a specific controle terminology term is designated as term metadata
. In a controled terminology, terms appear as a flat list
, meaning that no relationship between any of the entities the controlled terminology represents is captured in any formal way.
This is the main drawback and limitation of controled terminologies
, which are often developed to support a data model or an application.
Anontology
on the other hand, is a formal representation of a domain knowledge where concepts are organized hierarchically
. The qualifierformal
refers to a set of axioms and rules based on logic (e.g. first order logic
) to structure, organize and check the consistency of the term hierarchy. As one can sense right away, ontologies are often more sophisticated artefact, supported by a more advanced theorical frameworks and and dedicated tools to develp of them (e.g. Protégé, TopBraid Composer, OBO foundry INCAtools or Robot tool)
In order to improve over simple controled terminologies
, a huge area of research has developed to provide tools
and frameworks
supporting the representations of relationships between entities. The field is known as formal semantics
in knowledge representation circles. One of most immediately available example of entity relationships
and their potential for improving searches is the is_a
relationship, which aims to cover the Parent / Child relationship that holds between 2 entities. For instance:
-Vertebrate
--mammal
---dolphin
--bird
---pigeon
In this representation, classes
are directly asserted (placed) under a parent class if and only if the rule new class is a child of the parent Class
. 'Orchids', which in this hierarchy.
While working on small structured vocabularies, it is still possible to detect potential errors but this approach does not scale to support real life semantic artefacts which support complex biological and biomedical information systems. Languages (RDF,SKOS,OWL,) exist to provide the expressivity required to axiomatize relations between entities. In turns, building on these formal rules and associated proofers, automatic classifiers known as reasoner can inspect semantics artefacts to detect inconsistencies and to suggest parent classes. This is a step known as 'inference' where new knowledge is produced by the software agent rather than direct assertion by humans. This provide a significant support, even if far from supporting all the subtleties of actual knowledge.
There are 6 important features
to consider where selecting an semantic artefact for making FAIR datasets:
1. What license and terms of use does it mandate?
2. What format does it come in?
3. Is it well maintained ? i.e. frequent release, term requests handling, versioning and deprecation policies clarified.
4. Are there stable persistent resolvable identifiers for all terms?
5. Who use it and What resources are being annotated with it?
6. Is it well documented? There should be enough metadata for each class in the artefact and enough metadata about the artefact itself
As outlined in the introduction, the most immediate use for controled terminology is to ensure consistency in data entry. But the usefulness of ontologies and controled vocabularies goes beyond this initial use. The main purpose of biomedical ontologies is to structure knowledge so it can be operated on by software agents.
One needs to also understand that the two processes coexist and operate in parallel. As more experiments are performed, new discoveries are made. This new knowledge needs to be represented in the domain ontology so the new notions can be used to annotate the results of earlier experiments in the context of retrospective analysis.
For example, The Gene Ontology (GO) is a widely used resources to describe Molecular Processes
, Biological Functions
and Molecular Components
. The Gene Ontology Consortium maintains the controled vocabulary itself but also releases of Genome Wide Gene Ontology Annotations. These are resources which associate genes and genomic features found in those genomes with GO terms. These are very important resources especially in the context of genome wide analysis such as transcriptomics profiling analysis.
A particular type of analysis, enrichment analysis
, relies on the availability of such annotations to detect departures from expected probability distribution in an expression profile and which biological processes are most affected in specific conditions.
The applications are plentiful. The importance of ontologies for structuring information will only grow with the need to the obtain Machine Learning ready datasets and speed up the readiness of datasets. This is what FAIR is all about.
There is not simple answer to that question as it depends heavily on the type of tasks data scientists have in mind. If the purpose is simply to improve query recall on a limited set of fields, a curation policy could be devised to mix and match resources to meet the needs at hands, possibly by building an application ontology.
However, in a more integrated framework, it is important to be aware of the some development choices made by the maintainers of the semantic artefacts.
In the context of basic research and model organism based research, the OBO foundry
is an organization which coordinates the development of interoperable resources. GO, mentioned earlier is one of them. The establishment of domain specific reference ontologies sharing the same underlying rules means that some level of compositional development can be done. By this, it means that axioms can be built connecting classes from compatible resources.
This point becomes particularly important when considering the role of reasoner
when assessing and checking the consistency of artefacts themselves but also when analysing instance datasets themselves.
In the context of observation studies, the OMOP model also relies on controled terminologies such as SNOMED-CT, RxNORM for drugs and LOINC for clinical and laboratory test descriptions.
In the context of Clinical Data collections, the CDISC models work tightly with CDISC Terminology, National Cancer Institute's Enterprise Vocabulary Services (EVS) and also recommend use of SNOMED-CT and terminologies such as LOINC, both of which come with specific licensing terms users need to get familiar with.
The use and implementation of common terminologies will enable a normalization/harmonization of variable labels (data label) and allowed values (data term) when querying a database. Implementing the use of common terminologies in the curation workflow will ensure consistency of the annotation across all studies.
A set of widely accepted criteria for selecting terminologies (or other reporting standards) do not exists. However, the initial work by the Clinical and Translational Science Awards’ (CTSA) Omics Data Standards Working Group and BioSharing (http://jamia.bmj.com/content/early/2013/10/03/amiajnl-2013-002066.long) has been used as starting point top define possible criteria for excluding and/or including a terminology resource.
Exclusion criteria:
Inclusion criteria:
These criteria are simply indicative and need to be modulated depending on the contexts
described in the introduction, as specific constraints (e.g. regulatory requirements) may take precedence over some of the criteria listed here.
Choosing ontology and semantic resources is a complex issue, which requires careful consideration, taking into account the research context of the data production workflow, regulatory requirements that may apply. The choices made affect the integrative potential of a dataset as they influence the level of interoperability
.
Clearly declaring the semantic resources used to annotate a dataset also influence findability
and reusability
and it is good practice to do so.
What to read next?
@proccaserra That's more than starting point! Please tell me you didn't write that in the last couple of hours! ;)
That's amazing! We can definitely build upon this recipe. Thank you!
@daniwelter No! this is a collaboration with our NIH colleagues but we can expand on this work and dig a bit deeper, possibly connecting to the query expansion / (bio)solr search we discussed earlier this year ..that would be cool
Hi, we received a use case from the NIST genome editing consortium.
It will be nice to produce a recipe about why using ontologies, what is ontology, etc.
@proccaserra @daniwelter @ereynrs cc the FAIRplus ontologists, any suggestions for the recipe?