emhart / 10-simple-rules-data-storage

A repository for the 10 simple rules data sharing paper to be submitted to PLoS Comp Biology
Creative Commons Zero v1.0 Universal
25 stars 13 forks source link

Metadata describing data should use an ontology #22

Closed lindenb closed 9 years ago

lindenb commented 9 years ago

I'd like to cite the work of semantic-web people : they have built ontologies describing data. For example

EDAM : "EDAM: an ontology of bioinformatics operations, types of data and identifiers, topics and formats." http://www.ncbi.nlm.nih.gov/pubmed/23479348

A dataset using a RDF description would be discoverable by robots. In the best of all possible worlds articles and dataset would use this kind of ontology.

snim2 commented 9 years ago

This is really interesting. May I ask how you would expect this to be used? The Semantic Web people would expect their ontologies to be used by automated tools which might "discover" and use the data in some way -- perhaps by formatting it appropriately, returning it in search queries, combining it with other data, etc.

In terms of reproducible science I'm not sure how useful that is, but certainly a consistent, agreed format for metadata sounds like a very good idea. Perhaps you had an idea for how these sorts of ontologies could be used that I haven't thought of?

lindenb commented 9 years ago

@snim2 I was looking for a example of project using EDAM , I asked for it via twitter:

https://twitter.com/yokofakun/status/570165703959052288

I received a direct mail via 'jison' (EBI) . I copy his mail below:

EDAM underpins the ELIXIR Tools & Data Services Registry:

https://elixir-registry.cbs.dtu.dk/#/

ELIXIR is the European infrastructure for biological information: www.elixir-europe.org/

This is the main use-case currently, but there are others, e.g. EDAM is used internally by EMBL-EBI to characterise it tools. It's also being consumed by other ontologies, e.g. eagle-I, SWO and so crops up other contexts.

people can register new resources from https://elixir-registry.cbs.dtu.dk/#/signup

elixir

so, from this example, don't have to deal with RDF/EDAM by themselves but can register their resources. Those resources are then handled as a RDF+EDAM triple-store = discoverable by robots.

PBarmby commented 9 years ago

Maybe this is not an issue for readers of PLOS Comp Bio, but I had never heard the term "ontology" before and am still having trouble getting my head around what it means. Some unpacking might be needed.

snim2 commented 9 years ago

Hi @lindenb - that's very interesting, I found a nice paper on Elixir here: http://www.sciencedirect.com/science/article/pii/S0167779912000170 and it seems as if these systems are going after automated tools. Pages 13-15 here: http://www.infosys.com/infosys-labs/publications/Documents/SETLabs-briefings-healthcare-delivery.pdf#page=15 suggest some possible applications (drug discovery and development). I'm not qualified to evaluate those applications, but it would be interesting to see whether anyone has used the data in this way.

lindenb commented 9 years ago

@PBarmby see an ontology as a controlled vocabulary to describe things . Just like GeneOntology (GO): You wouldn't write "Gene X is involved in cardiac chamber morphogenesis" or "Gene X as a role in the morphogenesis of the heart" because it would be hell to retrieve the semantics of information . Instead of this people are using (or should use) GO: "Gene X: GO:0003206" http://www.ebi.ac.uk/QuickGO/GTerm?id=GO:0003206#term=info

snim2 commented 9 years ago

Just to add to that, the really big deal about ontologies is that they are machine readable. Sometimes they aren't so easy for humans to read!

lindenb commented 9 years ago

@snim2 my impression is that ontologies are not heavily used :-) Describing the data in Elixir is really nice but I wonder if it's in the scope of your paper; My original idea was about the way to describe the metadata : don't use "This data is a BAM file" , but rather a RDF(?)/N3 file with rdf:type= "http://www.ebi.ac.uk/ontology-lookup/?termId=format%3A2572"

snim2 commented 9 years ago

@lindenb This strikes me as a really good idea, but I can imagine that some people would find it hard to see the benefit. I'll keep looking around for any case studies or use cases that we could use.

lindenb commented 9 years ago

@PBarmby FYI: an astronomic ontology : http://www.astro.umd.edu/~eshaya/astro-onto/classes/galaxy.html

a galaxy is a sub-class of : extragalacticObject, rotatingBody, source

you could ask a semantic database (e.g: http://jena.apache.org/documentation/tdb/ ) : "search all the object with type=rotatingBody and type=extragalacticObject"

emhart commented 9 years ago

I'll just chime in here as someone who used to work with ontologies and semantics. It's undoubtedly very important to share data with an ontology and semantics. However many fields are still just developing robust metadata standards and getting researchers familiar with them(I'll point the finger at my own field of ecology). I see ontologies and semantics as further in the future for most researchers. Also ontologies are really more a function of data publication rather than storage, less relevant to this paper I believe. On Tue, Feb 24, 2015 at 7:16 AM Pierre Lindenbaum notifications@github.com wrote:

@PBarmby https://github.com/PBarmby FYI: an astronomic ontology : http://www.astro.umd.edu/~eshaya/astro-onto/classes/galaxy.html

a galaxy is a sub-class of : extragalacticObject, rotatingBody, source

you could ask a semantic database: "search all the object with type=rotatingBody and type=extragalacticObject"

Reply to this email directly or view it on GitHub https://github.com/emhart/10-simple-rules-data-storage/issues/22#issuecomment-75774866 .

tpoisot commented 9 years ago

But not all data types / fields have ontologies. I agree this is important, but we should also include a call for people to create them.

lindenb commented 9 years ago

@tpoisot I agree but most data & fields have a super-type (binary data, tabular data, "this-is-about-ecology", linked-to-this-paper) ( http://bioportal.bioontology.org/, NCBI MESH ... )

Again, I agree that writing a semantic description is difficult and probably useless for now. I don't use it myself. But I would recommend this a best practice rather than a plain text file describing your data ("metadata is a love note to the future" ).

For example see how people are using the DOAP ontology to describe their 'Description Of A' Project: https://github.com/search?utf8=%E2%9C%93&q=doap++language%3AXML+extension%3Ardf&type=Code&ref=advsearch&l=XML

naupaka commented 9 years ago

For ecology there is the Ecological Metadata Language and the Morpho software to help w/ using it. Not a true ontology, but in the right direction at least.

emhart commented 9 years ago

Seems like this issue is bleeding into #11 but I think we need to walk a fine line between best practices for storing data vs best practices for publishing data. In which case I think we should provide guidance about how to best link stored data to metadata, vs saying something like: "Published data sets should have metadata".

I propose we consider merging this with #11 describe how a stored dataset should be linked to it's metadata, and what are some easy ways it can be done, and what are some best practices type ways, e.g. ontologies.

jamesmalone commented 9 years ago

Hi, fwiw I thought I'd offer something :) (I'm James Malone, lead ontologist working at EBI). I would say you don't necessarily need ontologies for this. What I would say is describe your data using agreed semantics where possible. These should be both machine and human readable of course, as someone has already pointed out. There are several examples of reference ontologies such as the Gene Ontology that everyone uses - this means you get explicit and shared meaning and that's the key thing when you tie your data to it. Doesn't have to be an ontology of course, as Google's work on schema.org has shown, having common set of predicates and reasonably basic types is also incredibly useful. It's the shared meaning that's the important thing not whether it's an ontology imho.

joncison commented 9 years ago

Hi folks

Jon here (lead EDAM dev). This is just to underline the points of emhart and James (jamesmalone) and to give an insight from the sharp end.

Getting a robust metadata standard - that is practical, flexible and really serves the community - is the first and hard step, e.g. the schema underlying the ELIXIR Tools & Data Services Registry went through something like a dozen community workshops and 20 versions before being bumped to 1.0.

Then yes, a controlled vocabulary (semantics) where relevant. Even harder to get right than the metadata standard (for similar reasons) and it could be anything practical: don't get hung up what "ontology" is, even ontologists don't agree. The ELIXIR Registry uses a mixture of EDAM (in which bioinformatics concepts have persistent URIs however the terms themselves may be mutable, and have synonyms), and simple, stable enumerations of strings - including terms from SWO (the Software Ontology). The URIs exist forever - and can be resolved allowing the concepts (and associated terms) to be understood & computed.