ckan / ckanext-dcat

CKAN ♥ DCAT
164 stars 142 forks source link

Scheming support #281

Open amercader opened 1 month ago

amercader commented 1 month ago

This PR adds initial support for seamless integration between ckanext-dcat and ckanext-scheming, providing a custom profile that modifies the dataset dicts generated and consumed from the existing profiles so it plays well with the scheming presets defined.

Summary of changes

Compatibility and release plan

Extra care has been taken to not break any existing systems. Sites using the existing euro_dcat_ap and euro_dcat_ap_2 profiles should not see any change in their current parsing and serialization functionalities and these profiles will never change their outputs. Sites willing to migrate to a scheming based profile can do so by adding the new euro_dcat_ap_scheming profile at the end of their profile chain (value of ckanext.dcat.rdf.profiles config option, eg ckanext.dcat.rdf.profiles = euro_dcat_ap_2 euro_dcat_ap_scheming), which will modify the existing profile outputs to the expected format by the scheming validators. Note that the scheming profile will only affect fields defined in the schema definition file, so sites can start migrating gradually different metadata fields.

This compatibility profile will be released in the next ckanext-dcat version (1.8.0). The upcoming DCAT v3 based profiles for DCAT-AP 3 and DCAT-US 3 will be scheming based and will incorporate the mapping changes described below.

Mapping changes

The main changes between the old processors (parsers and serializers) and the new scheming-based ones are:

Root level fields

Custom DCAT fields that didn't link directly to standard CKAN fields were stored as extras (see all the ones marked extra: here). So the DCAT version_notes field would be stored as:

{
    "name": "test_dataset_dcat",
    "title": "Test dataset DCAT",
    "extras": [
         {"key": "version_notes", "value": "Some version notes"}
    ]
}

In the scheming-based profile, if the field is defined in the scheming schema, it will get stored as a root level field, like all custom dataset properties:

{
    "name": "test_dataset_dcat",
    "title": "Test dataset DCAT",
    "version_notes": "Some version notes"
}

List fields

The old profiles stored lists as JSON strings:

{
    "name": "test_dataset_dcat",
    "title": "Test dataset DCAT",
    "extras": [
         {"key": "conforms_to", "value":"[\"Standard 1\", \"Standard 2\"]"}
    ],
    "resources": [
        {
             "name": "Some resource",
             "documentation": "[\"http://dataset.info.org/distribution1/doc1\", \"http://dataset.info.org/distribution1/doc2\"]"
        }
    ]
}

By using the multiple_text preset, lists are now automatically handled:

{
    "name": "test_dataset_dcat",
    "title": "Test dataset DCAT",
    "conforms_to": [
         "Standard 1", 
         "Standard 2"
    ],
    "resources": [
        {
             "name": "Some resource",
             "documentation": [
                 "http://dataset.info.org/distribution1/doc1", 
                 "http://dataset.info.org/distribution1/doc2"
             ]
        }
    ]
}

The form snippets UI allows to provide multiple values:

Screenshot 2024-05-22 at 10-15-58 Dataset - CKAN

Repeating subfields

Mapping complex entities like dcat:contactPoint or dct:publisher was very limited, storing a subset of properties of just one linked entity as prefixed extras:

{
    "name": "test_dataset_dcat",
    "title": "Test dataset DCAT",
    "extras": [
        {"key":"contact_name","value":"PointofContact"},
        {"key":"contact_email","value":"contact@some.org"}
    ],
}

By using the repeating_subfields preset we can consume and present these as proper objects, and store multiple entities for those properties that have 0..n cardinality (see comment in "Issues"):

{
    "name": "test_dataset_dcat",
    "title": "Test dataset DCAT",
    "contact": [
        {
            "name": "Point of Contact 1",
            "email": "contact1@some.org"
        },
        {
            "name": "Point of Contact 2",
            "email": "contact2@some.org"
        },
    ]
}

Repeating subfields are also supported in resources/distributions. In this case complex objects like dcat:accessService were stored as JSON strings:

{
    "name": "test_dataset_dcat",
    "title": "Test dataset DCAT",
    "resources": [
        {
             "name": "Some resource",
             "access_services": "[{\"availability\": \"http://publications.europa.eu/resource/authority/planned-availability/AVAILABLE\", \"title\": \"Sparql-end Point\", \"endpoint_description\": \"SPARQL url description\", \"license\": \"http://publications.europa.eu/resource/authority/licence/COM_REUSE\", \"access_rights\": \"http://publications.europa.eu/resource/authority/access-right/PUBLIC\", \"description\": \"This SPARQL end point allow to directly query the EU Whoiswho content (organization / membership / person)\", \"endpoint_url\": [\"http://publications.europa.eu/webapi/rdf/sparql\"], \"uri\": \"\", \"access_service_ref\": \"N2ff5798aac56447e89438cc838512d26\"}]"
        }
    ]
}

They now appear as proper objects:

{
    "name": "test_dataset_dcat",
    "title": "Test dataset DCAT",
    "resources": [
        {
             "name": "Some resource",
             "access_services": [                                                                                                                                                                                                                                                 
                    {                                                                                                                                                                                                                                                                
                        "availability": "http://publications.europa.eu/resource/authority/planned-availability/AVAILABLE",                                                                                                                                                           
                        "title": "Sparql-end Point",                                                                                                                                                                                                                                 
                        "endpoint_description": "SPARQL url description",                                                                                                                                                                                                            
                        "license": "http://publications.europa.eu/resource/authority/licence/COM_REUSE",                                                                                                                                                                             
                        "access_rights": "http://publications.europa.eu/resource/authority/access-right/PUBLIC",                                                                                                                                                                     
                        "description": "This SPARQL end point allow to directly query the EU Whoiswho content (organization / membership / person)",                                                                                                                                 
                        "endpoint_url": [                                                                                                                                                                                                                                            
                            "http://publications.europa.eu/webapi/rdf/sparql"                                                                                                                                                                                                        
                        ],                                                                                                                                                                                                                                                           
                        "uri": "",                                                                                                                                                                                                                                                   
                    }                                                                                                                                                                                                                                                                
                ]
        }
    ]
}

Again, these can be easily managed via the UI thanks to the scheming form snippets:

Screenshot 2024-05-22 at 10-56-35 Dataset - CKAN

Issues

EricSoroos commented 1 month ago

@amercader I've been working in DCAT this week, including adding spec compliant HVD 2.2.0 output and scheming portions to the current 1.7 version. (somewhat split across dcat and our schema extension at the moment).

A couple of things have come up for making the output compliant with the HVD shaql files (https://semiceu.github.io/DCAT-AP/releases/2.2.0-hvd/#validation):

1) There are some items that need to be typed, e.g. licenses. This is a first cut, and I want to refactor this into the add_triples... methods: https://github.com/derilinx/ckanext-dcat/blob/dcat-hvd-2.2.0/ckanext/dcat/profiles.py#L914

 def _add_with_class(self, dataset_dict, dataset_ref, key, predicate, _type, _class, list_value=False):
        value = self._get_dataset_value(dataset_dict, key)

        def _add(v):
            ref = _type(v)
            self.g.add((ref, RDF.type, _class))
            self.g.add((dataset_ref, predicate, ref))

        if value:
            if list_value:
                for v in self._read_list_value(value):
                    _add(v)
            else:
                _add(value)
...
            self._add_with_class(resource_dict, distribution, 'license', DCT.license, URIRefOrLiteral, DCT.LicenseDocument)

gives us something like this:

...
<http://www.opendefinition.org/licenses/cc-by> a dct:LicenseDocument .

<http://data.europa.eu/eli/reg_impl/2023/138/oj> a <http://data.europa.eu/eli/ontology#LegalResource> .

<https://test.staging.derilinx.com/dataset/242e33cf-a097-4f59-94f3-25fcddeffaec/resource/52dcb446-d1f1-40d2-a515-bd708a57b9c6> a dcat:Distribution ;
    dcatap:applicableLegislation <http://data.europa.eu/eli/reg_impl/2023/138/oj> ;
    dct:format "HTML" ;
    dct:issued "2024-05-20T16:12:07"^^xsd:dateTime ;
    dct:license <http://www.opendefinition.org/licenses/cc-by> ;
    dct:modified "2024-05-20T17:00:51"^^xsd:dateTime ;
    dct:title "Test" ;
    dcat:accessURL <https://test.staging.derilinx.com/> .

2) Codelists are important, e.g., the HVD Category needs to be from this list: https://op.europa.eu/en/web/eu-vocabularies/dataset/-/resource?uri=http://publications.europa.eu/resource/dataset/high-value-dataset-category (which when it's not being slammed, has an RDF file with a skos:Concept and entries, each with a prefLabel from each official EU language.) (dl is here: https://op.europa.eu/o/opportal-service/euvoc-download-handler?cellarURI=http%3A%2F%2Fpublications.europa.eu%2Fresource%2Fcellar%2F29a21fd5-5c6f-11ee-9220-01aa75ed71a1.0001.02%2FDOC_1&fileName=high-value-dataset-category.rdf)

The codelists get rendered like this in the .ttl:

<https://test.staging.derilinx.com/dataset/242e33cf-a097-4f59-94f3-25fcddeffaec> a dcat:Dataset ;
    dcatap:applicableLegislation <http://data.europa.eu/eli/dir/2007/2/2019-06-26>,
        <http://data.europa.eu/eli/reg_impl/2023/138/oj> ;
    dcatap:hvdCategory <http://data.europa.eu/bna/c_dd313021> ;
    dct:identifier "242e33cf-a097-4f59-94f3-25fcddeffaec" ;
    dct:issued "2024-05-20T16:11:40"^^xsd:dateTime ;
    dct:language "en" ;
    dct:modified "2024-05-20T17:00:51"^^xsd:dateTime ;
    dct:publisher <https://test.staging.derilinx.com/organization/b30a8777-1478-43e1-8dcb-9beded4f5052> ;
    dct:title "Test Dataset" ;
    dcat:distribution <https://test.staging.derilinx.com/dataset/242e33cf-a097-4f59-94f3-25fcddeffaec/resource/52dcb446-d1f1-40d2-a515-bd708a57b9c6> .

<http://data.europa.eu/bna/c_dd313021> a skos:Concept ;
    skos:inScheme <http://data.europa.eu/bna/asd487ae75> .
amercader commented 1 month ago

@EricSoroos thanks for the feedback:

  1. HVD 2.20 support sounds amazing. Is this developed in a separate profile built on top of euro_dcat_ap_2? If so would be great to have it upstream. I see it as not directly related to this PR, once support for DCAT-AP 2.1 is ready we can create a separate profile and schema for HDV 2.2.0
  2. More generally regarding SHACL validation I've been thinking about integrating it as part of the test suite or even as a command that site maintainers can run as a way to "certify" support for the different DCAT specs
  3. Types: absolutely. I literally was thinking about this today with relation to the different types that dates can have according to the spec (see "Issues" in the description), but if it's a requirement of the SHAQL validation even more so. I like the approach of extending the _add_triples_... utility functions
  4. Codelists: yes, that's definitely on the list for the DCAT v3 profiles, as there are controlled vocabularies used but of course it also makes sense if needed for HDV. We could explore using choices or more likely choices_helper presets for required or recommended fields, with CLI commands to import them into the main or datastore database, plus a form snippet that shows the options (or autocompletes them if there are a lot of them)

Great to see you are working on this same area. If you have any feedback on the general approach followed for scheming support it would be great to hear it.

amercader commented 1 month ago

@seitenbau-govdata @bellisk would love to know your take on this, and see if this approach would play well with how you are using ckanext-dcat

EricSoroos commented 1 month ago

In terms of HVD support, the current EU DCAT 2 implementation is close, at least, it has all of the fields. This commit: https://github.com/derilinx/ckanext-dcat/commit/d5ef9f47346e4963de85daa2c62490dd1966557e is the difference, and it's only the codelist and two types that were required. There are some other compliance issues, like one of the license or rights needs to be available, and the applicable_legislation has to have at least one specific value. I'm looking at validation level stuff for those (legislation already done, license/rights not). I'm at the point of thinking that these things are more general, so _add_hvd_category should be _add_from_codelist.

I'm not clear that we'd necessarily want to be adding a separate profile for this -- Inheritance is really tricky when you're blatting in items to a graph, and may need to override just one piece of it. From what I can tell, the extra profiles tend to be aggregative and compatible, so realistically, there are potentially a few extra fields per entity and/or additional codes/required fields. Also, I think that the changes here are more of the form of "potentially backwards incompatible fixing the implementation" rather than actually adding support for the profile.

FWIW, I think this has been the general take previously, e.g, the geo fields are added from GeoDCAT.

For the Codelists, (at least on the scheming side) I've got something like this in my schema:

    {
      "field_name": "hvd_category",
      "grouping": "High Value Datasets",
      "label": "High Value Dataset Category",
      "form_snippet": "select.html",
      "validators": "ignore_missing",
      "choices_helper": "dlxschema_codelist_choices",
      "codelist": "high-value-dataset-category",
      "help_text": {
        "en": "EU Category for HVD."
      }
    },

And then the choices_helper is this:

@lru_cache(maxsize=None)
def _load_codelist(choices_path):
    """ Cache the json load, so that we're only actually reading once per invocation """
    return json.loads(choices_path.read_text())

def codelist_choices(field):
    """ Get the choices corresponding to the code list from the codelists directory                                                                                                

    :param name: string, name of the codelist, not including the extension                                                                                                         
    :returns: list of scheming choices                                                                                                                                             
    """

    name = field.get('codelist', None)
    if not name:
        return []
    choices_path = Path(__file__).parent / 'codelists' / (name + ".json")
    if not choices_path.exists:
        return []

    choices = _load_codelist(choices_path)
    return choices

The codelist directory has the .rdf and a .json converted from it, with the languages I'm interested in (though realistically, it wouldn't hurt to put all the eu languages in)

[
  {
    "label": {
      "en": "Geospatial",
      "ga": "Geosp\u00e1s\u00fail",
    },
    "value": "http://data.europa.eu/bna/c_ac64a52d"
  },
  {
    "label": {
      "en": "Earth observation and environment",
      "ga": "Faire na cruinne agus an comhshaol",
    },
    "value": "http://data.europa.eu/bna/c_dd313021"
  },
...

Right now, this is spread over my schema plugin and the dcat plugin, but the next iteration is going to need to pull the codelists into dcat so that I can kick out the prefLabels there.

EricSoroos commented 1 month ago

A little more thinking on the relationship between DCAT-AP base and the extension profiles (e.g. HVD, Geo, etc) .

I think that it would definitely make sense to have the individual profiles have either pluggable schema sections or diffs/inheritance against the core schema. E.g., Site A needs HVD, Site B needs HVD + Geo. We're using our schema_field_groups for this, so there's an HVD tab in the dataset view.

image

At the graph generation level, I don't know if there's a clean way to do this in an inherited manner. Right now, the EUDCAT2 is a combination of v1 + base + HVD + Geo. There's no issue adding the additional profiles here if the data doesn't support it.

Maybe a better way to do this would be composition rather than inheritance. E.g., have the profile configure a set of [ckan object]_to_graph methods, and those additional profiles would only be responsible for those items that aren't part of the base. As it is, it feels like the profile inheritance is quite chunky for adding a few fields.

wardi commented 1 month ago

I deliberately avoided implementing inheritance in scheming. Instead dcat fields can be defined as presets (including the field ids) and specific schemas then populate fields from the presets like

- preset: dcat_dataset_contact
- preset: dcat_dataset_publisher
...

This gives complete control from the specific schemas without needing to resolve inheritance issues.

EricSoroos commented 1 month ago

Fair enough -- It's not too hard to make sure that you have all of the chunks required for a specific profile, though for HVD they're scattered over both the dataset and resource.

amercader commented 1 month ago

Thanks @EricSoroos I need to think a bit more about how best to support extensions but I don't want to get too distracted from the original goal of this PR (having a base schema for the current dcat-ap 2 profile). If we can come up with a better way of extending profiles, by all means let's explore it.

After looking at the actual changes, my gut feeling is that the HVD support can be directly incorporated in the current profile for parsing/serializing without needing a separate one, and the scheming bits can be grouped in a preset (or two, one for dataset and one for resource fields). Have you implemented GeoDCAT support as well or it was just an example? Would be good to see different approaches for this

EricSoroos commented 1 month ago

We're reviewing Geo as well. (and a couple others, and localization)

Realistically, HVD is small except for some of the out of our scope spec issues (e.g., permanent identifiers, legislation in the MS to indicate that it's a HVD). It's also already included, so splitting it out involves some compatibility issues.

Geo is in a similar space -- it's been included forever so if we do add a separate profile for it, it would be for Shackq compliant (existing extension) backwards incompatible changes.

amercader commented 3 weeks ago

I think this is now ready to go, any further work should be done in separate PRs as this has grown quite a lot.

Highlights are:

wardi commented 3 weeks ago

This looks good.

I would be tempted to put more of the logic in the schemas but this extension needs to maintain backwards compatibility and ckanext-scheming-less operation so your approach makes sense.