ckan / ckanext-dcat

CKAN ♥ DCAT
https://docs.ckan.org/projects/ckanext-dcat
167 stars 147 forks source link

dcat:mediaType must be a resource #237

Open jze opened 1 year ago

jze commented 1 year ago

The range of dcat:mediaType has been tightened from dct:MediaTypeOrExtent to dct:MediaType as part of the revision of DCAT. https://www.w3.org/TR/vocab-dcat-2/#Property:distribution_media_type

Currently a URI or a literal is returned. https://github.com/ckan/ckanext-dcat/blob/master/ckanext/dcat/profiles.py#L1411 Only a URI should be used at this point.

jlanza commented 1 year ago

Maybe a newbie question. Why is it not required to explicitly say the the ref in the URI is point to a dct:MediaType class?

if mimetype:
    mimetype_ref = URIRef(mimetype)
    g.add((mimetype_ref, RDF.type, DCT.MediaType))
    g.add((distribution, DCAT.mediaType, mimetype_ref))
jlanza commented 1 year ago

I would like to add another comment concerning the same issue with the dcat:mediaType value. As from the DCAT-AP spec both dct:format and dcat:mediaType are dct:MediaType.

In this sense, if you consider using the full URI of IANA, that is for example https://www.iana.org/assignments/media-types/application/ld+json or the URI of the data.europa.eu vocabulary as suggested by the European Data Portal Metadata Quality Assessment Methodology, CKAN is not showing the previsualization.

Find below 2 examples of what I mean. It is not just the previsualization but the way the Dataset is later on serialized.

  1. Format set as JSON_LD and mediaType as the short IANA definition application/ld+json. You can see the previsualization.

jsonld-noref jlanza

In this case the serialization of the properties of the Dataset results in:

"dct:format": "JSON_LD"
"dcat:mediaType": "application/ld+json"
  1. Format set as full URI http://publications.europa.eu/resource/authority/file-type/JSON_LD and mediaType as https://www.iana.org/assignments/media-types/application/ld+json. You cannot see the previsualization.

jsonld-ref jlanza

In this case the serialization of the properties of the Dataset as JSON-LD results in:

"dct:format": {
        "@id": "http://publications.europa.eu/resource/authority/file-type/JSON_LD"
},
"dcat:mediaType": {
        "@id": "https://www.iana.org/assignments/media-types/application/ld+json"
 }

As you can see the first one is not fully compliant with DCAT-AP but CKAN behaves as expected. The second is just the other way round, complaint but CKAN is not working as expected.

In this sense, I don't know if it will be sensible to modify the dcat extension, mainly in the profiles definition, to check if the values of format and mediaType are URI references or just values. In case they are URIs we just left it untouched, but in case they aren't the logical thing will be to search for one that "resembles" or directly prepend the IANA or Europa Vocabulary domains and paths to get the full URI.

What do you think? Should I try to work that out?

Thanks for you help and comments.

seitenbau-govdata commented 1 year ago

The range of dcat:mediaType has been tightened from dct:MediaTypeOrExtent to dct:MediaType as part of the revision of DCAT. https://www.w3.org/TR/vocab-dcat-2/#Property:distribution_media_type

Currently a URI or a literal is returned. https://github.com/ckan/ckanext-dcat/blob/master/ckanext/dcat/profiles.py#L1411 Only a URI should be used at this point.

Yes, but it is serialized as a literal only if the value isn't a valid URI. This avoids resulting in an invalid serialized graph. That's more or less necessary, because the python library rdflib also creates serialized URIs with values that are an invalid URI.