Mapping data packages to DCAT2

augusto-herrmann commented 4 years ago

The W3C has just published as a recommendation the Data Catalog Vocabulary – DCAT, version 2. Should there be a mapping from the Data Package specification to DCAT2?

If so, I think it would be possible to implement with a JSON-LD context.

WDYT, @lwinfree, @rufuspollock, @roll ?

rufuspollock commented 4 years ago

@augusto-herrmann sure - do you want to prep one? We generally would like a "json-ld" context for data packages https://github.com/frictionlessdata/specs/issues/218

augusto-herrmann commented 4 years ago

@rufuspollock I did search the repository for JSON-LD, but somehow I missed those previous threads. Sorry. :flushed:

Anyway, for this issue, what I had in mind is smaller in scope: just mapping the data package metadata to DCAT2 classes and properties. DCAT2, as a W3C recommendation, is very recent – less than a month old. JSON-LD would be just a tool for achieving that goal.

Why is that important? Because it would make easier for data catalogs (such as CKAN, with the DCAT plugin) to automate importing data packages as datasets.

I am also well aware of the previous failed attempts trying to reconcile data packages and the W3C's CSVW and I'm not trying to delve into that at the moment. That and fully converting the data to linked data is a much harder problem that should be considered out of scope to what I'm proposing here.

I can try to sketch something, but I wanted to first make sure that it made sense, so as not to be a wasted effort.

rufuspollock commented 4 years ago

@augusto-herrmann 👍 on doing the mapping - go for it.

rufuspollock commented 4 years ago

@augusto-herrmann any update here 😄 ?

augusto-herrmann commented 4 years ago

I don't have a lot of time to devote to this right now, but here are some steps I thought about how to undertake this:

Take a look at the entity/class models of both Data Packages and DCAT2, looking for classes that could be considered equivalent or mappable (e.g. data package x dataset);
create a spreadsheet for establishing relations between, in one column, classes and properties from data packages and, in the other, the corresponding classes and properties of DCAT2. That could already be enough for the mapping proposed in this isse;
(optional next step) create code in any language to convert metadata from the Data Package format to DCAT2, in JSON-LD or any other RDF format;
(optional next step) if the mapping allows for it, create a JSON-LD context such that, when applied to a valid datapackage.json file, make it also valid JSON-LD that generates triples in DCAT2 format. That way, a single JSON file could comply with both standards.

rufuspollock commented 4 years ago

@augusto-herrmann we now have a spreadsheet via frictionlessdata/forum#11 and i've added a sheet for DCAT 2 https://docs.google.com/spreadsheets/d/1XdqGTFni5Jfs8AMbcbfsP7m11h9mOHS0eDtUZtqGVSg/edit#gid=729988073 - would you be up for adding the list of DCAT 2 attributes there?

augusto-herrmann commented 4 years ago

Sure. I'm working on it.

rufuspollock commented 3 years ago

@augusto-herrmann how is this going?

augusto-herrmann commented 3 years ago

I remember doing some of that work last year, but I need to verify at which point I left it (I do not remember anymore) and resume it when I have some time.

augusto-herrmann commented 3 years ago

Here is how much of it is done.

Classes:

[x] dcat:Catalog
[x] dcat:CatalogedResource
[x] dcat:CatalogRecord
[x] dcat:Resource
[ ] dcat:Dataset
[ ] dcat:Distribution
[ ] dcat:DataService
[ ] dcat:ConceptScheme
[ ] dcat:Concept
[ ] dcat:Organization / dcat:Person
[ ] dcat:Relationship
[ ] dcat:Role
[ ] dcat:PeriodOfTime
[ ] dcat:Location

Though I expect that most of the latter ones either don't map to Frictionless or are less relevant.

AyrtonB commented 3 years ago

Hi,

Thanks for all of the existing work on mappings, I'd like to help progress this and map out what's left. @augusto-herrmann are you continuing to work on this? Initially I plan to focus solely on mapping the Table Schema as that format fits the majority of the datasets we [in the project described below] will be working with.

I appreciate some of what's covered in this post might be slightly out of scope for this issue, let me know if it would be more useful to create a new one.

Context

I'm working on a project where the metadata of the datasets will be expressed in RDF, we're looking at using the Frictionless Data specifications and wider tools as a way to reduce the friction for users [of the climate data hub we're working on] to generate the RDF metadata. Specifically, one user story would be using data package creator to generate a datapackage.json, we'll create a Python library that can map from the datapackage.json to an RDF representation - though the use of a JSON-LD context could enable the datapackage.json to be understood as RDF (removing the need for a secondary mapping step).

Reasons we ideally want to have an RDF metadata representation:

Constructing SPARQL queries that can aid in data discovery - where RDF concepts could provide far more contextual information than for example simple keywords
Automated extraction of attributes from datasets (for fields using RDFType)
If all FD types have an RDF representation then it could be possible for a richer type to inherit from the FD RDF types and have both sets of information contained within a single attribute
Dataset linking - some of this is possible with FD already (e.g. with foreignKeys) but having an RDF representation that conforms with wider standards enables a wider set of tools to be used (e.g. SPARQL)
Units management - important in the domain of this project (energy systems) where we want to link to standardised units systems (e.g. QUDT)

Key components needed to enable this:

A mapping to RDF representations of FD schema attributes (where possible building on existing ontoligies such as DCAT)
A JSON-LD context which maps attributes from the existing FD table schema to their RDF representations (reducing user friction)

Existing FD/RDF Discussions

I thought it might be handy to summarise some of the existing discussions relating to using RDF for Frictionless Data packages

project/discussions/599
- Some confusion expressed around how best to use RDF types for fields
- Interest expressed in the ability to include units within field metadata using RDF
project/discussions/618
- Discuss around how best to express version control metadata
- Example of DCAT ontology having version control metadata that could be useful for data packages
specs/issues/110
- Discussion around the creation of a JSON-LD context for Data Package and Tabular Data Package
- [Discussion](JSON-LD context for Data Package and Tabular Data Package) around the benefits of using JSON-LD to embed contextual links to semantic nodes that enable better explorability than keywords alone
- Mention of using RDFa-Lite which seems relatively simple to use, as well as schema.org more generally
- Example mappings between dataset ontologies
- In order for a datapackage.json to be valid JSON-LD it would require @id and @context entries, some concerns around the additional complexity this brings
- Actually looks like it just needs @context in order to be valid JSON-LSD
- Could specify mappings from FD to JSON-LD entries within the specific JSON-LD context file - discussed here
specs/issues/216
- Discussion around how units could be mapped to existing ontologies such as QUDT
- Could aim to integrate units with Python libraries such as Pint which would enable automated unit handling when carrying out analysis, example shown here
- Some differences between Pint and QUDT, e.g. Pint has units such as dtype="pint[lbf ft]" whereas QUDT has the name FT-LB_F for the same unit
- The Open Power System Data and Open Climate Data orgs come up as examples where data packages already include a units attribute
- Interest in automated unit conversions but deemed out-of-spec
specs/issues/218
- Follows on from specs/issues/110
- Attempting to find a use-case where there would be a lot of benefit - struggling
- Discussion around the benefits of using JSON-LD for making datasets findable by search engines
- The conversation in this thread seems to have gone stale
specs/issues/343
- Discusses deprecating rdfType in favour of generalising the model to enable expression of semantic links to RDF concepts
- Some push back on this by people who find rdfType useful
specs/issues/437
- Interest in the ability to sub-type specific fields, something that RDF mappings of those types would enable
project/issues/551
- Interest from OKF in using JSON-LD for data package contexts
- Effort has begun to map attributes from DCAT to OKF-FD specs here
- Unsure if work is currently continuing on these mappings
specs/issues/686
- Request for allowing more than one value for the rdfType attribute
- This would enable types from different ontologies/vocabularies to be provided
- Would be worth thinking about the trade-offs around enforcing downstream tools to be able to semantically combine multiple attributes relative to if a single RDF decription is created that is a subclass of those multiple attributes
- Someone raises that '"experimental_condition1.auc" is more than just 'auc' as currently mapped. I'd like to specify'auc' 'computed_over' 'experimental_condition1', which is currently lost in the representation.' In this particular example there's a clear benefit in having the user create a separate RDF instance for this field that also expresses the relationships between concepts rather than simply pointing to both of them.
- Forcing users to create new RDF instances is additional friction though

augusto-herrmann commented 3 years ago

Wow, @AyrtonB, this seems like an amazing project! 😀 Also you've made an awesome summary there of the discussions surrounding Frictionless and RDF. Good job!

A month ago I progressed a little bit more on the mapping, but I was taking this slowly, because I'm always busy with lots of other things. If you've got a dedicated project to take this forward, it makes total sense that you'd take it over from here. The checklist above pretty much marks the point where I left off. So you could continue the mapping by reviewing the parts I've already done and continue what is still left to do.

lwinfree commented 3 years ago

Hi @AyrtonB! This looks awesome! I think you've done a great job summarizing the work that needs to be done and the current situation. I'm on the Frictionless team (with @roll) and would be happy to support you if you have questions or need help. Communicating on github works very well for us, but if you want to have a call to chat let me know :-)

AyrtonB commented 3 years ago

Thanks. That's handy to know @augusto-herrmann, sounds good I'll work from that.

@lwinfree that would be really helpful, thank you. I think for this specific issue the next steps are pretty clear and I'll try to keep the convo in this thread atm. For the project I'm working on we're also looking at creating a custom schema for 'data dictionaries' that can act as a central link between different datasets - e.g. where for the original datasets you only have to specify one foreignKey that maps to the primaryKey in a data dictionary, removing duplication and ensuring only one central dataset needs to be updated when a new dataset is mapped in. It would be really useful to have a call around the implementation of this use-case if possible :)

lwinfree commented 3 years ago

Sounds great @AyrtonB!

AyrtonB commented 3 years ago

Update Overview

I've made some progress towards creating a parser for generation an RDF representation of the table schema datapackage.json. Currently I have a Python script that includes a tableSchema object, this generates a graph representation of the table schema using rdflib when passed the url for a datapackage. I've documented the code generation and application in this notebook. An example RDF representation can be found here, the original input was this Open Power Systems Data datapackage.json.

This is very much a pilot but has been useful in terms of working out which terms in different ontologies could be useful, as well as identifying how best to approach the parsing with Python. Next I plan to create an ontology for the frictionless data spec that includes a tableSchema object (building on existing specs), I'll then use the concepts described in it and refactor the Python code.

Progression

Originally was looking into whether the addition of a single entry @context would enable the datapackage.json files to be converted into JSON-LD
The only way to do that seems to involve changing the structure of the datapackage.json which would likely not be ideal for the FD specification
Moved back towards my original idea that a Python script would be required to map the datapackage.json to an RDF representation
Initially started using jinja2 templates for raw RDF.xml
The logic in the template quickly began balloning so I switched to programmatically generating a triplet graph using rdflib, then exporting the final graph to an RDF.xml file
Developed a script for generating a pilot RDF representation of the Table Schema from a datapackage.json url

Open Questions

How best to handle blank nodes? e.g. for each resource/dataset
Do we need both dataset and distribution? - Currently all attributes are tied to the distribution and dataset is only used as a link to the catalog
What needs adding to a separate FD ontology?
What previous issues existed with CSVW? - Looks ideal for describing fields but previous comments have alluded to incompatibility
Should the table-schema data-structure be described in OWL?
Does coverage need to be comprehensive or would it work to start from a subset?
What's are the semantic differences between: name, title, description, and longDescription

Next Steps

Create the Table Schema ontology
Refactor the parser to the new FD specific ontology
Should test how well the resulting RDF representations work with the RDF DCAT parser for CKAN
Create a parser for converting from the RDF triplets back to a datapackage.json (could then create a function that outputs a percentage based on the number of attributes that survived the cycle - a good foundation for some tests)

augusto-herrmann commented 3 years ago

Currently I have a Python script that includes a tableSchema object, this generates a graph representation of the table schema using rdflib when passed the url for a datapackage. I've documented the code generation and application in this notebook.

That is a cool experiment! Have you finished already the mapping I started on the Google Spreadsheet linked above? Because I think we should only start developing practical implementations once the mapping of concepts between Tabular Data Packages and DCAT2 is completely finished. Otherwise we may end up doing unnecessary repetition.

The only way to do that seems to involve changing the structure of the datapackage.json which would likely not be ideal for the FD specification

Why would creating a JSON-LD @context for frictionless require changing the structure of datapackage.json? Could you please elaborate that?

How best to handle blank nodes? e.g. for each resource/dataset

Perhaps data packages and resources should not be blank nodes, but URI identified instead. There could be a determined way to create those URIs, appending hashes at the end, something like #datapackage for packages/datasets and #resource-name for resources.

Do we need both dataset and distribution? - Currently all attributes are tied to the distribution and dataset is only used as a link to the catalog

DCAT2 makes a distinction between dataset and distribution in the way that the former is more abstract and the latter is a concrete representation or serialization of the same data. While this makes philosophical sense, from a pragmatic point of view there is no use doing that separation. In Frictionless, the different representations of the same data are just different resources in the same data package.

It is not an exact match, but maybe the dcat:Dataset should be mapped to f11d:Package and dcat:Distribution to f11d:Resource.

What previous issues existed with CSVW? - Looks ideal for describing fields but previous comments have alluded to incompatibility

I do not remember exactly, but I think the main problem is that CSVW makes using RDF mandatory and many in the community would rather make it optional.

Should the table-schema data-structure be described in OWL?

I think the initial scope of this issue was just mapping data packages in a general sense, which may not even be tabular. Table schema could be a possible next step after this is done.

Does coverage need to be comprehensive or would it work to start from a subset?

I think this mapping should enable conversion of metadata from the basic Data Package -> DCAT2 and DCAT2 -> Data Package with as little information loss as possible. Except, for the moment, for profiles, like table schema, of course.

What's are the semantic differences between: name, title, description, and longDescription

The differences between name, title and description are all laid out in the specification. I checked but could not find a longDescription attribute, neither in the Data Package specification, nor in DCAT2. Where did you find it?

lwinfree commented 2 years ago

Hi @AyrtonB and @augusto-herrmann! I'm working on cleaning up issues in this repo and wanted to touch base with you two about this. Thanks for all the work you've done so far! Are you still interested in working on this? There is no pressure from me, and no time crunch, I'm just inquiring about the status :-) Hope you are both doing OK!

augusto-herrmann commented 2 years ago

Hi @lwinfree, thanks for bringing this up. Yes, I am still interested in this issue, but don't have much time do dedicate to solve it – last I checked, some effort was required to finish the mapping between the f11d specs and DCAT2 classes and properties in RDF. I'm curious to find out how far @AyrtonB has been able to progress with it, or in more detail the questions I asked above, if possible.

roll commented 7 months ago

Hi people,

I have implemented an initial version of DCAT mapper (highly based on ckanext-dcat) for the new dplib-py library (lightweight data package models). If you are interested in contributing, please take a look at https://github.com/frictionlessdata/dplib-py/blob/main/dplib/plugins/dcat/models/package.py (and resource.py) -- it uses a very simple Pydantic-based model framework so it's quite easy to do metadata mappings in fully-typed manner

PS. Note that unlike ckan and others dcat mapper has two layers mapping rfd -> dcat model -> dp model. Actually, I think it will be great if it's possible to extract something like dcatlib and re-use in both dplib-py and ckanext-dcat

frictionlessdata / frictionlessdata.io