Open amercader opened 1 month ago
@amercader I've been working in DCAT this week, including adding spec compliant HVD 2.2.0 output and scheming portions to the current 1.7 version. (somewhat split across dcat and our schema extension at the moment).
A couple of things have come up for making the output compliant with the HVD shaql files (https://semiceu.github.io/DCAT-AP/releases/2.2.0-hvd/#validation):
1) There are some items that need to be typed, e.g. licenses. This is a first cut, and I want to refactor this into the add_triples... methods: https://github.com/derilinx/ckanext-dcat/blob/dcat-hvd-2.2.0/ckanext/dcat/profiles.py#L914
def _add_with_class(self, dataset_dict, dataset_ref, key, predicate, _type, _class, list_value=False):
value = self._get_dataset_value(dataset_dict, key)
def _add(v):
ref = _type(v)
self.g.add((ref, RDF.type, _class))
self.g.add((dataset_ref, predicate, ref))
if value:
if list_value:
for v in self._read_list_value(value):
_add(v)
else:
_add(value)
...
self._add_with_class(resource_dict, distribution, 'license', DCT.license, URIRefOrLiteral, DCT.LicenseDocument)
gives us something like this:
...
<http://www.opendefinition.org/licenses/cc-by> a dct:LicenseDocument .
<http://data.europa.eu/eli/reg_impl/2023/138/oj> a <http://data.europa.eu/eli/ontology#LegalResource> .
<https://test.staging.derilinx.com/dataset/242e33cf-a097-4f59-94f3-25fcddeffaec/resource/52dcb446-d1f1-40d2-a515-bd708a57b9c6> a dcat:Distribution ;
dcatap:applicableLegislation <http://data.europa.eu/eli/reg_impl/2023/138/oj> ;
dct:format "HTML" ;
dct:issued "2024-05-20T16:12:07"^^xsd:dateTime ;
dct:license <http://www.opendefinition.org/licenses/cc-by> ;
dct:modified "2024-05-20T17:00:51"^^xsd:dateTime ;
dct:title "Test" ;
dcat:accessURL <https://test.staging.derilinx.com/> .
2) Codelists are important, e.g., the HVD Category needs to be from this list: https://op.europa.eu/en/web/eu-vocabularies/dataset/-/resource?uri=http://publications.europa.eu/resource/dataset/high-value-dataset-category (which when it's not being slammed, has an RDF file with a skos:Concept and entries, each with a prefLabel from each official EU language.) (dl is here: https://op.europa.eu/o/opportal-service/euvoc-download-handler?cellarURI=http%3A%2F%2Fpublications.europa.eu%2Fresource%2Fcellar%2F29a21fd5-5c6f-11ee-9220-01aa75ed71a1.0001.02%2FDOC_1&fileName=high-value-dataset-category.rdf)
The codelists get rendered like this in the .ttl:
<https://test.staging.derilinx.com/dataset/242e33cf-a097-4f59-94f3-25fcddeffaec> a dcat:Dataset ;
dcatap:applicableLegislation <http://data.europa.eu/eli/dir/2007/2/2019-06-26>,
<http://data.europa.eu/eli/reg_impl/2023/138/oj> ;
dcatap:hvdCategory <http://data.europa.eu/bna/c_dd313021> ;
dct:identifier "242e33cf-a097-4f59-94f3-25fcddeffaec" ;
dct:issued "2024-05-20T16:11:40"^^xsd:dateTime ;
dct:language "en" ;
dct:modified "2024-05-20T17:00:51"^^xsd:dateTime ;
dct:publisher <https://test.staging.derilinx.com/organization/b30a8777-1478-43e1-8dcb-9beded4f5052> ;
dct:title "Test Dataset" ;
dcat:distribution <https://test.staging.derilinx.com/dataset/242e33cf-a097-4f59-94f3-25fcddeffaec/resource/52dcb446-d1f1-40d2-a515-bd708a57b9c6> .
<http://data.europa.eu/bna/c_dd313021> a skos:Concept ;
skos:inScheme <http://data.europa.eu/bna/asd487ae75> .
@EricSoroos thanks for the feedback:
euro_dcat_ap_2
? If so would be great to have it upstream. I see it as not directly related to this PR, once support for DCAT-AP 2.1 is ready we can create a separate profile and schema for HDV 2.2.0_add_triples_...
utility functions choices
or more likely choices_helper
presets for required or recommended fields, with CLI commands to import them into the main or datastore database, plus a form snippet that shows the options (or autocompletes them if there are a lot of them)Great to see you are working on this same area. If you have any feedback on the general approach followed for scheming support it would be great to hear it.
@seitenbau-govdata @bellisk would love to know your take on this, and see if this approach would play well with how you are using ckanext-dcat
In terms of HVD support, the current EU DCAT 2 implementation is close, at least, it has all of the fields. This commit: https://github.com/derilinx/ckanext-dcat/commit/d5ef9f47346e4963de85daa2c62490dd1966557e is the difference, and it's only the codelist and two types that were required. There are some other compliance issues, like one of the license or rights needs to be available, and the applicable_legislation has to have at least one specific value. I'm looking at validation level stuff for those (legislation already done, license/rights not). I'm at the point of thinking that these things are more general, so _add_hvd_category
should be _add_from_codelist
.
I'm not clear that we'd necessarily want to be adding a separate profile for this -- Inheritance is really tricky when you're blatting in items to a graph, and may need to override just one piece of it. From what I can tell, the extra profiles tend to be aggregative and compatible, so realistically, there are potentially a few extra fields per entity and/or additional codes/required fields. Also, I think that the changes here are more of the form of "potentially backwards incompatible fixing the implementation" rather than actually adding support for the profile.
FWIW, I think this has been the general take previously, e.g, the geo fields are added from GeoDCAT.
For the Codelists, (at least on the scheming side) I've got something like this in my schema:
{
"field_name": "hvd_category",
"grouping": "High Value Datasets",
"label": "High Value Dataset Category",
"form_snippet": "select.html",
"validators": "ignore_missing",
"choices_helper": "dlxschema_codelist_choices",
"codelist": "high-value-dataset-category",
"help_text": {
"en": "EU Category for HVD."
}
},
And then the choices_helper is this:
@lru_cache(maxsize=None)
def _load_codelist(choices_path):
""" Cache the json load, so that we're only actually reading once per invocation """
return json.loads(choices_path.read_text())
def codelist_choices(field):
""" Get the choices corresponding to the code list from the codelists directory
:param name: string, name of the codelist, not including the extension
:returns: list of scheming choices
"""
name = field.get('codelist', None)
if not name:
return []
choices_path = Path(__file__).parent / 'codelists' / (name + ".json")
if not choices_path.exists:
return []
choices = _load_codelist(choices_path)
return choices
The codelist directory has the .rdf
and a .json
converted from it, with the languages I'm interested in (though realistically, it wouldn't hurt to put all the eu languages in)
[
{
"label": {
"en": "Geospatial",
"ga": "Geosp\u00e1s\u00fail",
},
"value": "http://data.europa.eu/bna/c_ac64a52d"
},
{
"label": {
"en": "Earth observation and environment",
"ga": "Faire na cruinne agus an comhshaol",
},
"value": "http://data.europa.eu/bna/c_dd313021"
},
...
Right now, this is spread over my schema plugin and the dcat plugin, but the next iteration is going to need to pull the codelists into dcat so that I can kick out the prefLabels there.
A little more thinking on the relationship between DCAT-AP base and the extension profiles (e.g. HVD, Geo, etc) .
I think that it would definitely make sense to have the individual profiles have either pluggable schema sections or diffs/inheritance against the core schema. E.g., Site A needs HVD, Site B needs HVD + Geo. We're using our schema_field_groups for this, so there's an HVD tab in the dataset view.
At the graph generation level, I don't know if there's a clean way to do this in an inherited manner. Right now, the EUDCAT2 is a combination of v1 + base + HVD + Geo. There's no issue adding the additional profiles here if the data doesn't support it.
Maybe a better way to do this would be composition rather than inheritance. E.g., have the profile configure a set of [ckan object]_to_graph methods, and those additional profiles would only be responsible for those items that aren't part of the base. As it is, it feels like the profile inheritance is quite chunky for adding a few fields.
I deliberately avoided implementing inheritance in scheming. Instead dcat fields can be defined as presets (including the field ids) and specific schemas then populate fields from the presets like
- preset: dcat_dataset_contact
- preset: dcat_dataset_publisher
...
This gives complete control from the specific schemas without needing to resolve inheritance issues.
Fair enough -- It's not too hard to make sure that you have all of the chunks required for a specific profile, though for HVD they're scattered over both the dataset and resource.
Thanks @EricSoroos I need to think a bit more about how best to support extensions but I don't want to get too distracted from the original goal of this PR (having a base schema for the current dcat-ap 2 profile). If we can come up with a better way of extending profiles, by all means let's explore it.
After looking at the actual changes, my gut feeling is that the HVD support can be directly incorporated in the current profile for parsing/serializing without needing a separate one, and the scheming bits can be grouped in a preset (or two, one for dataset and one for resource fields). Have you implemented GeoDCAT support as well or it was just an example? Would be good to see different approaches for this
We're reviewing Geo as well. (and a couple others, and localization)
Realistically, HVD is small except for some of the out of our scope spec issues (e.g., permanent identifiers, legislation in the MS to indicate that it's a HVD). It's also already included, so splitting it out involves some compatibility issues.
Geo is in a similar space -- it's been included forever so if we do add a separate profile for it, it would be for Shackq compliant (existing extension) backwards incompatible changes.
I think this is now ready to go, any further work should be done in separate PRs as this has grown quite a lot.
Highlights are:
This looks good.
I would be tempted to put more of the logic in the schemas but this extension needs to maintain backwards compatibility and ckanext-scheming-less operation so your approach makes sense.
This PR adds initial support for seamless integration between ckanext-dcat and ckanext-scheming, providing a custom profile that modifies the dataset dicts generated and consumed from the existing profiles so it plays well with the scheming presets defined.
Summary of changes
ckanext/dcat/schemas/dcat_ap_2.1.yaml
)RDFProfile
class to access schema field definitions from datasets and resourceseuro_dcat_ap_scheming
profile that adds support for the field serializations supported by the ckanext-scheming presets. The existing profiles (euro_dcat_ap
andeuro_dcat_ap_2
) remain unchanged (except for some very minor backward compatible changes regarding the handling of access services in distributions/resources). This means that existing sites will keep working as currently, but maintainers can choose to enable scheming support if they choose to migrate to that approach. Upcoming DCAT 3 based profiles will be scheming based (in a new ckanext-dcat version)Compatibility and release plan
Extra care has been taken to not break any existing systems. Sites using the existing
euro_dcat_ap
andeuro_dcat_ap_2
profiles should not see any change in their current parsing and serialization functionalities and these profiles will never change their outputs. Sites willing to migrate to a scheming based profile can do so by adding the neweuro_dcat_ap_scheming
profile at the end of their profile chain (value ofckanext.dcat.rdf.profiles
config option, egckanext.dcat.rdf.profiles = euro_dcat_ap_2 euro_dcat_ap_scheming
), which will modify the existing profile outputs to the expected format by the scheming validators. Note that the scheming profile will only affect fields defined in the schema definition file, so sites can start migrating gradually different metadata fields.This compatibility profile will be released in the next ckanext-dcat version (1.8.0). The upcoming DCAT v3 based profiles for DCAT-AP 3 and DCAT-US 3 will be scheming based and will incorporate the mapping changes described below.
Mapping changes
The main changes between the old processors (parsers and serializers) and the new scheming-based ones are:
Root level fields
Custom DCAT fields that didn't link directly to standard CKAN fields were stored as extras (see all the ones marked
extra:
here). So the DCATversion_notes
field would be stored as:In the scheming-based profile, if the field is defined in the scheming schema, it will get stored as a root level field, like all custom dataset properties:
List fields
The old profiles stored lists as JSON strings:
By using the
multiple_text
preset, lists are now automatically handled:The form snippets UI allows to provide multiple values:
Repeating subfields
Mapping complex entities like
dcat:contactPoint
ordct:publisher
was very limited, storing a subset of properties of just one linked entity as prefixed extras:By using the
repeating_subfields
preset we can consume and present these as proper objects, and store multiple entities for those properties that have 0..n cardinality (see comment in "Issues"):Repeating subfields are also supported in resources/distributions. In this case complex objects like
dcat:accessService
were stored as JSON strings:They now appear as proper objects:
Again, these can be easily managed via the UI thanks to the scheming form snippets:
Issues
dct:publisher
that have 0..1 cardinality, I don't think CKAN supports "non-repeating" subfields so it makes sense to use therepeating_subfields
one for now and create a new one in the future.date
anddatetime
with nice UI form snippets so it's tempting to use them for properties likeissued
and modified, but these support other formats likexsd:gYear
orxsd:gYearMonth
which will fail with these presets so we can consider creating a new one that extends the existing ones to support these formats