FAIRplus / the-fair-cookbook

The FAIR cookbook, containing recipes to make your data more FAIR. Find the rendered version on:
https://faircookbook.elixir-europe.org/
126 stars 57 forks source link

New recipe request, what is ontology and why use ontologies #113

Closed FuqiX closed 2 years ago

FuqiX commented 4 years ago

Hi, we received a use case from the NIST genome editing consortium.

I want to convince NIST that it's useful to think about ontologies, and they should comply with standards for EBI deposition later on.

It will be nice to produce a recipe about why using ontologies, what is ontology, etc.

@proccaserra @daniwelter @ereynrs cc the FAIRplus ontologists, any suggestions for the recipe?

proccaserra commented 4 years ago

@FuqiX I can see several angles but we'd need more background information. Can you provide more details about the requests from NIST or interactions so far?

Regarding the recipes, I have generated a recipe about how to select ontologies . I know you saw it and you pointed me to the following material:

As to 'Why to use ontologies`, we touched upon this but could be expanded. Here we should link with @oyadenizbeyan @iemam
as this relates to CMM

ereynrs commented 4 years ago

I see the recipes as sets of steps depicting how to do something. Maybe some kind of introduction/background section is needed, describing some foundational artifacts (e.g., ontologies) and WHY they are FAIR principles enablers.

FuqiX commented 4 years ago

Hi,

TL;DR We need a recipe/introductory section about the importance of ontologies and help users identify their ontology use cases and find the correct recipes for their use cases.

Background info:

The NIST consortium are building CRISPR data standards, which includes related ontology recommendations, checklists, and other FAIR practices. Similar to the transcriptomics data standards we worked with.

As Emiliano pointed out, they are looking for foundational artefacts, like ontologies and schemas, that are related to the FAIR principles. Maybe another introductory recipe/section of the book on the importance of fair data which links out to the individual sections of the cookbook.

Currently, our discussion focuses on ontologies. The first need is to provide an introduction about the importance and benefit of using ontology (that's the recipe I am requesting in this issue). Also, for people who are not familiar with ontologies, it would be nice to provide them with a general picture of this domain.

We have got an ontology recipe collection in the cookbook. However, it is not helpful unless we can also pull a description of how to put them together though. To discuss this further, Helen suggested we can add reciprocal links in OLS and point people to ontology recipes. The challenge is that our recipes are way down at a very granular level of detail and it is hard to navigate through this collection. There are some pieces missing, also they are not aligned with use cases or the FAIRification process.

@mcourtot Please feel free to add more details.

daniwelter commented 4 years ago

Hi @FuqiX, @proccaserra,

I agree that we could do with a more general introductory section on ontologies and I'd be happy to contribute to this (as long as it can wait until after the vF2F). I'm not sure it necessarily needs to be a full "Ontology 101" though. Plenty of excellent work has been done in that respect and the introduction should link out to existing papers, blogs and guides that cover the subject and tie it all together in the context of F+.

proccaserra commented 4 years ago

@daniwelter @FuqiX I have the following we probably build upon:

Controlled Terminologies & Ontologies

authors: Philippe Rocca-Serra

maintainers: Philippe Rocca-Serra

version: initial draft

license: CC0 1.0 Universal (CC0 1.0) Public Domain Dedication


Objectives:

The aim of this recipe is to provide a compact introduction about controled terminologies and ontologies, why these resources are central to the preservation of knowledge and data mining and how such resources are developed.

Controlled terminology or Ontology, what's the difference ?

The need for controled vocabulary often arises in situation where validation of textual information is necessary for operational requirements. The main initial driver for data entry harmonization is to increase query recall. It is most basic form, keywords may be used to perform indexation. However, if relying on user input alone, the chances of typographic errors increases with the number of users. These unavoidable events accumulate over time and end up hurting the accuracy of search results and this is the reason for offering sets of predefined values. It reduces the noise. However this can come at the cost of precision, as the predefined terms may not cover the exact thing users may need to describe. Furthermore, term mis-selection by user is not eliminated and introduces another type of error.

A controled terminology is a normative collection of terms, the spelling of which is fixed and for which additional information may be provided such as definition, a set of synonyms, a editor, a version as well as a license determining the condition of use. The set of information about a specific controle terminology term is designated as term metadata. In a controled terminology, terms appear as a flat list, meaning that no relationship between any of the entities the controlled terminology represents is captured in any formal way. This is the main drawback and limitation of controled terminologies, which are often developed to support a data model or an application.

Anontology on the other hand, is a formal representation of a domain knowledge where concepts are organized hierarchically. The qualifierformal refers to a set of axioms and rules based on logic (e.g. first order logic) to structure, organize and check the consistency of the term hierarchy. As one can sense right away, ontologies are often more sophisticated artefact, supported by a more advanced theorical frameworks and and dedicated tools to develp of them (e.g. Protégé, TopBraid Composer, OBO foundry INCAtools or Robot tool)

How are they built and maintained and why does it matter?

In order to improve over simple controled terminologies, a huge area of research has developed to provide tools and frameworks supporting the representations of relationships between entities. The field is known as formal semantics in knowledge representation circles. One of most immediately available example of entity relationships and their potential for improving searches is the is_a relationship, which aims to cover the Parent / Child relationship that holds between 2 entities. For instance:

-Vertebrate
--mammal
---dolphin
--bird
---pigeon

In this representation, classes are directly asserted (placed) under a parent class if and only if the rule new class is a child of the parent Class. 'Orchids', which in this hierarchy.

While working on small structured vocabularies, it is still possible to detect potential errors but this approach does not scale to support real life semantic artefacts which support complex biological and biomedical information systems. Languages (RDF,SKOS,OWL,) exist to provide the expressivity required to axiomatize relations between entities. In turns, building on these formal rules and associated proofers, automatic classifiers known as reasoner can inspect semantics artefacts to detect inconsistencies and to suggest parent classes. This is a step known as 'inference' where new knowledge is produced by the software agent rather than direct assertion by humans. This provide a significant support, even if far from supporting all the subtleties of actual knowledge.

There are 6 important features to consider where selecting an semantic artefact for making FAIR datasets:

1. What license and terms of use does it mandate?

2. What format does it come in?

3. Is it well maintained ? i.e. frequent release, term requests handling, versioning and deprecation policies clarified.

4. Are there stable persistent resolvable identifiers for all terms?

5. Who use it and What resources are being annotated with it?

6. Is it well documented? There should be enough metadata for each class in the artefact and enough metadata about the artefact itself

Why are they useful?

As outlined in the introduction, the most immediate use for controled terminology is to ensure consistency in data entry. But the usefulness of ontologies and controled vocabularies goes beyond this initial use. The main purpose of biomedical ontologies is to structure knowledge so it can be operated on by software agents.

One needs to also understand that the two processes coexist and operate in parallel. As more experiments are performed, new discoveries are made. This new knowledge needs to be represented in the domain ontology so the new notions can be used to annotate the results of earlier experiments in the context of retrospective analysis.

For example, The Gene Ontology (GO) is a widely used resources to describe Molecular Processes, Biological Functions and Molecular Components. The Gene Ontology Consortium maintains the controled vocabulary itself but also releases of Genome Wide Gene Ontology Annotations. These are resources which associate genes and genomic features found in those genomes with GO terms. These are very important resources especially in the context of genome wide analysis such as transcriptomics profiling analysis. A particular type of analysis, enrichment analysis, relies on the availability of such annotations to detect departures from expected probability distribution in an expression profile and which biological processes are most affected in specific conditions.

The applications are plentiful. The importance of ontologies for structuring information will only grow with the need to the obtain Machine Learning ready datasets and speed up the readiness of datasets. This is what FAIR is all about.

Are all ontologies compatible with each other?

There is not simple answer to that question as it depends heavily on the type of tasks data scientists have in mind. If the purpose is simply to improve query recall on a limited set of fields, a curation policy could be devised to mix and match resources to meet the needs at hands, possibly by building an application ontology.

However, in a more integrated framework, it is important to be aware of the some development choices made by the maintainers of the semantic artefacts.

Use Cases and Iterative Approach

The use and implementation of common terminologies will enable a normalization/harmonization of variable labels (data label) and allowed values (data term) when querying a database. Implementing the use of common terminologies in the curation workflow will ensure consistency of the annotation across all studies.

Selection Criteria

A set of widely accepted criteria for selecting terminologies (or other reporting standards) do not exists. However, the initial work by the Clinical and Translational Science Awards’ (CTSA) Omics Data Standards Working Group and BioSharing (http://jamia.bmj.com/content/early/2013/10/03/amiajnl-2013-002066.long) has been used as starting point top define possible criteria for excluding and/or including a terminology resource.

These criteria are simply indicative and need to be modulated depending on the contexts described in the introduction, as specific constraints (e.g. regulatory requirements) may take precedence over some of the criteria listed here.


Conclusions:

Choosing ontology and semantic resources is a complex issue, which requires careful consideration, taking into account the research context of the data production workflow, regulatory requirements that may apply. The choices made affect the integrative potential of a dataset as they influence the level of interoperability. Clearly declaring the semantic resources used to annotate a dataset also influence findability and reusability and it is good practice to do so.

What to read next?


Bibliography:

  1. RDF.https://www.w3.org/TR/2014/NOTE-rdf11-primer-20140624/
  2. SKOS. https://www.w3.org/2004/02/skos/
  3. OWL. https://www.w3.org/OWL/
  4. Hermit. http://www.hermit-reasoner.com/
  5. Elk. http://www.cs.ox.ac.uk/isg/tools/ELK/
  6. OBO Foundry. http://obofoundry.org/
  7. CDISC. https://www.cdisc.org/standards
  8. CDISC Controlled Terminology. https://www.cdisc.org/standards/terminology
  9. LOINC. https://loinc.org/
  10. Gene Ontology. http://geneontology.org/
  11. Protégé. https://protege.stanford.edu/
  12. Topbraid composer. https://www.topquadrant.com/products/topbraid-composer/
  13. INCAtools. https://github.com/INCATools
  14. ROBOT. R.C. Jackson, J.P. Balhoff, E. Douglass, N.L. Harris, C.J. Mungall, and J.A. Overton. ROBOT: A tool for automating ontology workflows. BMC Bioinformatics, vol. 20, July 2019.
daniwelter commented 4 years ago

@proccaserra That's more than starting point! Please tell me you didn't write that in the last couple of hours! ;)

FuqiX commented 4 years ago

That's amazing! We can definitely build upon this recipe. Thank you!

proccaserra commented 4 years ago

@daniwelter No! this is a collaboration with our NIH colleagues but we can expand on this work and dig a bit deeper, possibly connecting to the query expansion / (bio)solr search we discussed earlier this year ..that would be cool

proccaserra commented 2 years ago

https://w3id.org/faircookbook/FCB019