SEMICeu / iso-19139-to-dcat-ap

Reference XSLT-based implementation of GeoDCAT-AP
European Union Public License 1.2
15 stars 9 forks source link

Mapping textual description of distribution encodings to URIs #24

Closed andrea-perego closed 3 years ago

andrea-perego commented 3 years ago

Currently, the XSLT output includes URIs for formats only if they are present in the source record.

The reason is twofold:

  1. Using URIs for formats is a recommended practice, as it allows the unambiguous identification of the format, and ensures interoperability
  2. Textual descriptions / labels for file formats are extremely heterogeneous, and it is not possible to take into account all the possible variants

On the other hand, there are also reasons for supporting a text-to-URI mapping:

  1. The current version of the DCAT-AP SHACL constraints requires formats to be specified with a URI reference - see point (4) in https://github.com/SEMICeu/iso-19139-to-dcat-ap/issues/22#issuecomment-765743742
  2. Distribution format is an important piece of information in data catalogues, for filtering purposes

Looking at the geospatial records available from the European Data Portal, using URIs for file formats is far from being a common practice.

So, the proposal is to revise the XSLT to include a provisional mapping from textual labels to URIs, which can be phased out in the future. For the textual labels to be mapped to URIs, those most frequently used for geospatial metadata in the European Data Portal can be taken into account. The full list can be obtained via the the following SPARQL queries:

Of course, this solution will not ensure that all distributions will have a format specified via a URI. But this is not the purpose of this revision / patch.

andrea-perego commented 3 years ago

The proposed revision has been implemented in PR https://github.com/SEMICeu/iso-19139-to-dcat-ap/pull/26

The adopted approach is as follows:

  1. The reference URI registers used are, in order of precedence:
    • The OP's file types NAL (i.e., the one recommended by DCAT-AP)
    • The IANA Media Types register
    • The INSPIRE Media Types register
  2. If the format specified in the textual label does not correspond to any of the entries in the reference registers, the closest entry from the reference registers is used (e.g., XML for XML-based formats)
  3. When the textual label denotes a service (WMS, WFS, etc.), it is mapped to the primary / default output format of such service from the reference registers (e.g., the format for CSW is XML)
andrea-perego commented 3 years ago

As no objections were raised, I will merge PR https://github.com/SEMICeu/iso-19139-to-dcat-ap/pull/26 and close this issue.