How to publish Open Data from MELODIES

jonblower commented 9 years ago

This is to start a discussion on how we should publish data from the MELODIES project as Open Data. The ideal situation is to publish five-star linked open data where everything is described as RDF, with links to other datasets and vocabularies.

The current list of open data planned from the MELODIES project is on the [EMDESK Wiki]() (perhaps we should move it to GitHub? -> see #5), although we should consider other datasets too in an attempt to identify generically-useful methods.

We consider three levels of information:

Individual observations or measurements.
Collections of observations/measurements, i.e. datasets.
Collections of datasets, i.e. catalogues.

Our goals are:

We are obliged to publish dataset in the GEOSS DataCORE. There are various ways to do this, with instructions here. For example, we can submit metadata documents, or provide an OpenSearch endpoint.
We would like to appear in the Linked Open Data Cloud, which means publishing through the VoID vocabulary.
We would also like to appear in Google searches, which could be achieved by describing data through schema.org, although I'm not sure how this works in practice.
We would like to be able to visualise and interact with data at the level of observations (not just datasets), meaning the data themselves must be available on the web in some useful web-friendly way.

How can we achieve the above? Questions include:

Which vocabularies/ontologies to use?
Where should we host the RDF descriptions of datasets? On our own servers, or can we publish them elsewhere? Don't forget that we want to demonstrate geospatial linked data and not all data-hosting sites are geo-enabled.
How do we publish the data themselves, given that an RDF dump of a large raster dataset is probably not a good idea? And how do we link data files to the metadata descriptions (and vice versa)?
How do we expose data to interactive web portals (which means interacting at the level of observations, not just datasets)?

Discussion is welcome!

letmaik commented 9 years ago

The first three goals are all at dataset level. For that we should at a minimum use the established W3C vocabularies DCAT and VoID for describing datasets (I don't think schema.org will be that useful currently, but I may be wrong). DCAT and VoID have some overlap in general metadata (importing other vocabularies like DC and FOAF). The difference between them is that DCAT is for arbitrary datasets (actual dataset can be random files), while VoID is specifically meant for LOD datasets where the whole dataset consists of RDF triples. This is made obvious by the fact that you should provide a SPARQL endpoint within VoID for the dataset, and may provide statistics of triples (counts, basically dataset size). With VoID an OpenSearch.xml description can also be linked to, but this is only for searching within the dataset with free text search.

Some facts:

DCAT allows to refer directly to dataset files with given mime-types (called "download URL"), which means that there is no layer for describing available observations within DCAT (if we use this separation)
DCAT also allows to point to an "access URL", which is "A landing page, feed, SPARQL endpoint or other type of resource that gives access to the distribution of the dataset"
VoID only has a Dataset level, DCAT has Catalog and Dataset ~~(we probably can ignore Catalog)~~
DCAT allows to describe temporal and spatial extent of a dataset, which will be useful for geo search engines (and for providing an OpenSearch Geo/Time service)

About metadata hosting, in the end the datasets will be hosted on some server anyway, providing things like a GeoSPARQL endpoint, and some way for accessing raster data in an intelligent way with OPeNDAP or WCS for example (possibly linked to via RDF somehow). And on that server, the dataset probably has its own URL where the metadata can be stored alongside as well. VoID describes three ways of doing just that. I think that's a minimum. The next step would be to point catalogs to this metadata so they can harvest it. However I don't know of any catalog which has what we want. As far as I can see only the closed ones from NASA for example have rich query capabilities like bounding box and time range searching. We may have some luck in adding a bit more temporal and geospatial sauce to CKAN (which is the software used for catalog portals like datahub.io) in case the available plugins are not enough. That could be one of the Melodies contributions on the software side.

Things I haven't discussed here are how to model datasets themselves, how observations are linked to the metadata, and how to integrate raster data. I think this has to be cleared first before thinking about how to expose it in graphical portals.

p3dr0 commented 9 years ago

+1 for DCAT ... just please don't ignore the "Catalog" element

however DCAT capabilities for geo are quite feeble ... I've been following the discussion on the geo-dcat application profile that might be a solution but I'm still not really convinced about it (probably too much INSPIRE-antibodies on my blood stream) nevertheless this is probably something to check http://joinup.ec.europa.eu/mailman/listinfo/dcat_application_profile-geo

jonblower commented 9 years ago

Thanks Pedro - what do you think MELODIES should do for a catalogue? Should we expose our own "demonstrator" catalogue (e.g. with OpenSearch Geo/time interfaces)? Or is there another catalogue we could plug into (e.g. on Terradue's platform) that we could use to demonstrate what we have been doing?

p3dr0 commented 9 years ago

Currently each partner has data repositories and catalogue services as part of the cloud platform baseline services and have been exploited in developing and integrating their MELODIES services. What we are missing is a public top level catalogue that could aggregate/expose particular collections as Open Data.

This study in WP3 will be very useful to frame the metadata model. Among others, it will help us to check the feasibility of DCAT to improve our catalogue solution.

jonblower commented 9 years ago

Currently we’re thinking of publishing the MELODIES catalogue as an RDF document using DCAT (and maybe VoID). We think that CKAN instances can harvest this. Would this work for Terradue? How might we include OpenSearch capabilities?

letmaik commented 9 years ago

I think we should split this issue up as it covers too much. So I think in general we have these topics:

high-level interoperable dataset description (using DCAT/VoID etc.) -> for ingestion into catalogues, discovery etc.
exposing the data of the datasets themselves (O&M, custom ontologies, etc.) -> for data users, web portals, etc
linking both worlds
where to host the data and metadata

Working on these separately and doing some discussion cross-referencing is better I think than having a massive thread covering everything.

letmaik commented 9 years ago

I have opened separate smaller discussions (#6, #7, #8) now. If I missed anything please go ahead and create another issue and link to this one (#3). Please don't add further comments in this mother thread (only if absolutely necessary for some reason).

jonblower commented 9 years ago

(I created a new issue #9 to discuss where MELODIES data should be published.)

ec-melodies / melodies-all

How to publish Open Data from MELODIES #3