frictionlessdata / frictionlessdata.io

The main repository of the Frictionless Data project. Website, issues, and discussions
http://frictionlessdata.io
MIT License
140 stars 54 forks source link

Mapping data packages to DCAT2 #551

Open augusto-herrmann opened 4 years ago

augusto-herrmann commented 4 years ago

The W3C has just published as a recommendation the Data Catalog Vocabulary – DCAT, version 2. Should there be a mapping from the Data Package specification to DCAT2?

If so, I think it would be possible to implement with a JSON-LD context.

WDYT, @lwinfree, @rufuspollock, @roll ?

rufuspollock commented 4 years ago

@augusto-herrmann sure - do you want to prep one? We generally would like a "json-ld" context for data packages https://github.com/frictionlessdata/specs/issues/218

augusto-herrmann commented 4 years ago

@rufuspollock I did search the repository for JSON-LD, but somehow I missed those previous threads. Sorry. :flushed:

Anyway, for this issue, what I had in mind is smaller in scope: just mapping the data package metadata to DCAT2 classes and properties. DCAT2, as a W3C recommendation, is very recent – less than a month old. JSON-LD would be just a tool for achieving that goal.

Why is that important? Because it would make easier for data catalogs (such as CKAN, with the DCAT plugin) to automate importing data packages as datasets.

I am also well aware of the previous failed attempts trying to reconcile data packages and the W3C's CSVW and I'm not trying to delve into that at the moment. That and fully converting the data to linked data is a much harder problem that should be considered out of scope to what I'm proposing here.

I can try to sketch something, but I wanted to first make sure that it made sense, so as not to be a wasted effort.

rufuspollock commented 4 years ago

@augusto-herrmann 👍 on doing the mapping - go for it.

rufuspollock commented 4 years ago

@augusto-herrmann any update here 😄 ?

augusto-herrmann commented 4 years ago

I don't have a lot of time to devote to this right now, but here are some steps I thought about how to undertake this:

  1. Take a look at the entity/class models of both Data Packages and DCAT2, looking for classes that could be considered equivalent or mappable (e.g. data package x dataset);
  2. create a spreadsheet for establishing relations between, in one column, classes and properties from data packages and, in the other, the corresponding classes and properties of DCAT2. That could already be enough for the mapping proposed in this isse;
  3. (optional next step) create code in any language to convert metadata from the Data Package format to DCAT2, in JSON-LD or any other RDF format;
  4. (optional next step) if the mapping allows for it, create a JSON-LD context such that, when applied to a valid datapackage.json file, make it also valid JSON-LD that generates triples in DCAT2 format. That way, a single JSON file could comply with both standards.
rufuspollock commented 4 years ago

@augusto-herrmann we now have a spreadsheet via frictionlessdata/forum#11 and i've added a sheet for DCAT 2 https://docs.google.com/spreadsheets/d/1XdqGTFni5Jfs8AMbcbfsP7m11h9mOHS0eDtUZtqGVSg/edit#gid=729988073 - would you be up for adding the list of DCAT 2 attributes there?

augusto-herrmann commented 4 years ago

Sure. I'm working on it.

rufuspollock commented 3 years ago

@augusto-herrmann how is this going?

augusto-herrmann commented 3 years ago

I remember doing some of that work last year, but I need to verify at which point I left it (I do not remember anymore) and resume it when I have some time.

augusto-herrmann commented 3 years ago

Here is how much of it is done.

Classes:

Though I expect that most of the latter ones either don't map to Frictionless or are less relevant.

AyrtonB commented 3 years ago

Hi,

Thanks for all of the existing work on mappings, I'd like to help progress this and map out what's left. @augusto-herrmann are you continuing to work on this? Initially I plan to focus solely on mapping the Table Schema as that format fits the majority of the datasets we [in the project described below] will be working with.

I appreciate some of what's covered in this post might be slightly out of scope for this issue, let me know if it would be more useful to create a new one.

Context

I'm working on a project where the metadata of the datasets will be expressed in RDF, we're looking at using the Frictionless Data specifications and wider tools as a way to reduce the friction for users [of the climate data hub we're working on] to generate the RDF metadata. Specifically, one user story would be using data package creator to generate a datapackage.json, we'll create a Python library that can map from the datapackage.json to an RDF representation - though the use of a JSON-LD context could enable the datapackage.json to be understood as RDF (removing the need for a secondary mapping step).

RDF Metadata Generation Steps

Reasons we ideally want to have an RDF metadata representation:

Key components needed to enable this:


Existing FD/RDF Discussions

I thought it might be handy to summarise some of the existing discussions relating to using RDF for Frictionless Data packages

augusto-herrmann commented 3 years ago

Wow, @AyrtonB, this seems like an amazing project! 😀 Also you've made an awesome summary there of the discussions surrounding Frictionless and RDF. Good job!

A month ago I progressed a little bit more on the mapping, but I was taking this slowly, because I'm always busy with lots of other things. If you've got a dedicated project to take this forward, it makes total sense that you'd take it over from here. The checklist above pretty much marks the point where I left off. So you could continue the mapping by reviewing the parts I've already done and continue what is still left to do.

lwinfree commented 3 years ago

Hi @AyrtonB! This looks awesome! I think you've done a great job summarizing the work that needs to be done and the current situation. I'm on the Frictionless team (with @roll) and would be happy to support you if you have questions or need help. Communicating on github works very well for us, but if you want to have a call to chat let me know :-)

AyrtonB commented 3 years ago

Thanks. That's handy to know @augusto-herrmann, sounds good I'll work from that.

@lwinfree that would be really helpful, thank you. I think for this specific issue the next steps are pretty clear and I'll try to keep the convo in this thread atm. For the project I'm working on we're also looking at creating a custom schema for 'data dictionaries' that can act as a central link between different datasets - e.g. where for the original datasets you only have to specify one foreignKey that maps to the primaryKey in a data dictionary, removing duplication and ensuring only one central dataset needs to be updated when a new dataset is mapped in. It would be really useful to have a call around the implementation of this use-case if possible :)

lwinfree commented 3 years ago

Sounds great @AyrtonB!

AyrtonB commented 3 years ago

Update Overview

I've made some progress towards creating a parser for generation an RDF representation of the table schema datapackage.json. Currently I have a Python script that includes a tableSchema object, this generates a graph representation of the table schema using rdflib when passed the url for a datapackage. I've documented the code generation and application in this notebook. An example RDF representation can be found here, the original input was this Open Power Systems Data datapackage.json.

This is very much a pilot but has been useful in terms of working out which terms in different ontologies could be useful, as well as identifying how best to approach the parsing with Python. Next I plan to create an ontology for the frictionless data spec that includes a tableSchema object (building on existing specs), I'll then use the concepts described in it and refactor the Python code.

Progression

Open Questions

Next Steps

augusto-herrmann commented 3 years ago

Currently I have a Python script that includes a tableSchema object, this generates a graph representation of the table schema using rdflib when passed the url for a datapackage. I've documented the code generation and application in this notebook.

That is a cool experiment! Have you finished already the mapping I started on the Google Spreadsheet linked above? Because I think we should only start developing practical implementations once the mapping of concepts between Tabular Data Packages and DCAT2 is completely finished. Otherwise we may end up doing unnecessary repetition.

  • The only way to do that seems to involve changing the structure of the datapackage.json which would likely not be ideal for the FD specification

Why would creating a JSON-LD @context for frictionless require changing the structure of datapackage.json? Could you please elaborate that?

  • How best to handle blank nodes? e.g. for each resource/dataset

Perhaps data packages and resources should not be blank nodes, but URI identified instead. There could be a determined way to create those URIs, appending hashes at the end, something like #datapackage for packages/datasets and #resource-name for resources.

  • Do we need both dataset and distribution? - Currently all attributes are tied to the distribution and dataset is only used as a link to the catalog

DCAT2 makes a distinction between dataset and distribution in the way that the former is more abstract and the latter is a concrete representation or serialization of the same data. While this makes philosophical sense, from a pragmatic point of view there is no use doing that separation. In Frictionless, the different representations of the same data are just different resources in the same data package.

It is not an exact match, but maybe the dcat:Dataset should be mapped to f11d:Package and dcat:Distribution to f11d:Resource.

  • What previous issues existed with CSVW? - Looks ideal for describing fields but previous comments have alluded to incompatibility

I do not remember exactly, but I think the main problem is that CSVW makes using RDF mandatory and many in the community would rather make it optional.

  • Should the table-schema data-structure be described in OWL?

I think the initial scope of this issue was just mapping data packages in a general sense, which may not even be tabular. Table schema could be a possible next step after this is done.

  • Does coverage need to be comprehensive or would it work to start from a subset?

I think this mapping should enable conversion of metadata from the basic Data Package -> DCAT2 and DCAT2 -> Data Package with as little information loss as possible. Except, for the moment, for profiles, like table schema, of course.

  • What's are the semantic differences between: name, title, description, and longDescription

The differences between name, title and description are all laid out in the specification. I checked but could not find a longDescription attribute, neither in the Data Package specification, nor in DCAT2. Where did you find it?

lwinfree commented 2 years ago

Hi @AyrtonB and @augusto-herrmann! I'm working on cleaning up issues in this repo and wanted to touch base with you two about this. Thanks for all the work you've done so far! Are you still interested in working on this? There is no pressure from me, and no time crunch, I'm just inquiring about the status :-) Hope you are both doing OK!

augusto-herrmann commented 2 years ago

Hi @lwinfree, thanks for bringing this up. Yes, I am still interested in this issue, but don't have much time do dedicate to solve it – last I checked, some effort was required to finish the mapping between the f11d specs and DCAT2 classes and properties in RDF. I'm curious to find out how far @AyrtonB has been able to progress with it, or in more detail the questions I asked above, if possible.

roll commented 7 months ago

Hi people,

I have implemented an initial version of DCAT mapper (highly based on ckanext-dcat) for the new dplib-py library (lightweight data package models). If you are interested in contributing, please take a look at https://github.com/frictionlessdata/dplib-py/blob/main/dplib/plugins/dcat/models/package.py (and resource.py) -- it uses a very simple Pydantic-based model framework so it's quite easy to do metadata mappings in fully-typed manner

PS. Note that unlike ckan and others dcat mapper has two layers mapping rfd -> dcat model -> dp model. Actually, I think it will be great if it's possible to extract something like dcatlib and re-use in both dplib-py and ckanext-dcat