ckan / ckanext-dcat

CKAN ♥ DCAT
https://docs.ckan.org/projects/ckanext-dcat
167 stars 147 forks source link

_object_value and _object_value_list return BNode identifiers #289

Open EricSoroos opened 4 months ago

EricSoroos commented 4 months ago

While reviewing the scheming PR #281, I've found a couple of places where the DCAT RDF Harvester in json-ld format is having trouble with in-the-wild DCAT 2.1.1 feeds. (Specifically, an ESRI AGOL Inspire feed: https://opendata-ifigeo.hub.arcgis.com/api/feed/dcat-ap/2.1.1.json). (This doesn't appear to be related to the PR, so here it is)

Generally, _object_value and _object_value_list are returning the string value of the node, and in cases where the node has a type and something other than a direct value, this returns the internal node id of the BNode.

For example, with this (not terribly useful, but syntactically representative) provenance:

            "dct:provenance": {
                "@type": "dct:ProvenanceStatement",
                "@label": {
                    "@value": ""
                }
            },

We extract: 'provenance', ('extras', 19, 'value'): 'Nc0c0162afbe140a5afa2736468e1da4c',.

Similarly, the theme:

            "dcat:theme": {
                "@type": "skos:Concept",
                "skos:prefLabel": "Geospatial"
            },

also returns a internal node id. This is almost never going to be a useful result, because the identifiers are ephemeral, and only valid while the graph is in memory.

I'm not clear on the best course of action here, I see a couple.

amercader commented 4 months ago

I think these cases should be handled in the parser methods, and I agree that it's useless to store the BNode value.

Perhaps if the object in in _object_value (or one of the items in _object_value_list it's a BNode then we inspect that node and extract whatever makes more sense, a Literal if it's there, or the value of skos:prefLabel if it's a node of type skos:concept. That should hopefully cover the theme case.

BTW this particular serialization for provenance is not a valid JSON-LD, @label is not a valid keyword. I'm by no means a JSON-LD expert but I think @value should be used instead (otherwise rdflib can not extract anything from that node):

            "dct:provenance": {
                "@type": "dct:ProvenanceStatement",
                    "@value": "Something actually useful"

            },