SEMICeu / DCAT-AP

This is the issue tracker for the maintenance of DCAT-AP
https://joinup.ec.europa.eu/solution/dcat-application-profile-data-portals-europe
72 stars 24 forks source link

PO: file-type: Offer All Information Within the RDF #234

Closed init-dcat-ap-de closed 12 months ago

init-dcat-ap-de commented 1 year ago

I was looking at the web information about ARC_GZ (as an example): https://op.europa.eu/de/web/eu-vocabularies/concept/-/resource?uri=http://publications.europa.eu/resource/authority/file-type/ARC_GZ

The page shows the mime-type as "application/gzip" and the file extension as ".arc.gz". I was looking for this information within the rdf-representation at http://publications.europa.eu/resource/authority/file-type/ARC_GZ but there I cannot find them. At http://publications.europa.eu/resource/authority/file-type/CSV the PO offers the information that it dcterms:conformsTo https://www.iana.org/assignments/media-types/text/csv

So the media type can be found there, but the file extension is still invisible in the rdf. For most (?) file extensions, they are often identical with or can be derived from the data found in e.g. dc:identifier. But since the real information exists, this should not be the way to receive those information.

I also submitted this issue via the web form. I add it here for further reference and in case anyone has an idea which properties would be a good fit for this use case.

H-a-g-L commented 1 year ago

The IANA media-type and extension are encoded as http://publications.europa.eu/ontology/authority/legacy-code although not all file-types in the list have them. You may use this query to see which ones do.

init-dcat-ap-de commented 1 year ago

Thank you @ODP-hil, this looks very useful. Can you post the SPARQL-Query? So there are already properties for iana type and file extension.

I would love to see them in the current RDF.

H-a-g-L commented 1 year ago

Dear @init-dcat-ap-de I am pasting an updated query with the MIME-Type encoded as an xlNotation. This is a mandatory property for all Concepts. the other IANA variables are probably redundant at this point but I left them in the query so you could evaluate which ones to use (EP: https://publications.europa.eu/webapi/rdf/sparql):

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix dct: <http://purl.org/dc/terms/>
prefix euvoc: <http://publications.europa.eu/ontology/euvoc#>
prefix at: <http://publications.europa.eu/ontology/authority/>
prefix dc: <http://purl.org/dc/elements/1.1/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

SELECT DISTINCT ?OPfileType ?notationIANAType ?ianaMediaType ?ianaCode ?fileExtension
FROM <http://publications.europa.eu/resource/authority/file-type>

WHERE {
?OPfileType skos:inScheme <http://publications.europa.eu/resource/authority/file-type>;
     euvoc:xlNotation ?notation.
?notation dct:type <http://publications.europa.eu/resource/authority/notation-type/IANA_MT>;
     euvoc:xlCodification ?notationIANAType.

OPTIONAL {?OPfileType dct:conformsTo ?ianaMediaType.}
OPTIONAL {?legacyCodeIana dc:source "mime-type-cellar".}
OPTIONAL {?legacyCodeExtenion dc:source "file-extension". }
OPTIONAL {?OPfileType  at:op-mapped-code ?legacyCodeIana.
   ?legacyCodeIana at:legacy-code ?ianaCode.}
OPTIONAL {?OPfileType  at:op-mapped-code ?legacyCodeExtenion.
   ?legacyCodeExtenion at:legacy-code ?fileExtension.}
}
ORDER BY ?OPfileType
init-dcat-ap-de commented 1 year ago

Thank you, so the file extension is within the at:op-mapped-code-node. For 7Z it is ".7z"

Unfortunately the at:op-mapped-code-node is a blank node which is not included in the rdf document. We only get:

<ns9:op-mapped-code rdf:nodeID="b123813191" />

So the problem is probably in the export generation of the RDF files.

H-a-g-L commented 1 year ago

Yes indeed but you can download the the RDF (either skos or skos-xl) directly from the download tab of the NAL page (https://op.europa.eu/en/web/eu-vocabularies/dataset/-/resource?uri=http://publications.europa.eu/resource/dataset/file-type). In the RDF this is encoded as follows:

        <at:op-mapped-code>
            <at:MappedCode>
                <dc:source>file-extension</dc:source>
                <at:legacy-code>.7z</at:legacy-code>
            </at:MappedCode>
        </at:op-mapped-code>
bertvannuffelen commented 1 year ago

FYI, in the DCAT SHACL shapes, the content negotation is used to download in the DCAT-AP validator the codelists dynamically (options with full in the name use the imports). See https://github.com/SEMICeu/DCAT-AP/blob/master/releases/2.1.1/dcat-ap_2.1.1_shacl_mdr_imports.ttl

As action from this we might have to check if this list is still up to date.

bertvannuffelen commented 1 year ago

@init-dcat-ap-de I suppose @ODP-hil explained the organisation of the EU NAL file type and that we can close this exchange.

init-dcat-ap-de commented 12 months ago

op-info-helpdesk@publications.europa.eu wrote:

As you can see in the actual source code of the File Type Authority table (can be downloaded from here: File type - EU Vocabularies - Publications Office of the EU (europa.eu)) there are no blank nodes in the data. Consequently you will also have no issue if you access the data by means of SPARQL scripts on the SPARQL endpoint (https://publications.europa.eu/webapi/rdf/sparql). Nevertheless we acknowledge that the way it is displayed by accessing the URI does present blank node. Unfortunately the issue is related with the rendering mechanism of the website. We are aware of the situation and we are looking for solutions to eliminate. Until then please the standard SPARQL endpoint to access the data.