INSPIRE-MIF / gp-geopackage-encodings

Good practice for GeoPackage encodings of INSPIRE datasets
7 stars 4 forks source link

[END] Code lists #17

Closed heidivanparys closed 1 year ago

heidivanparys commented 3 years ago

The idea behind the table CodelistProperties is related to what the Schema extension tries to do.

I think there are several alternatives:

  1. Keep the table but adapt the naming to the naming in the GeoPackage spec (see also #11): use column name table_name instead of tableName and column_name instead of propertyNameName, change the name of the table to e.g. gpkgext_data_column_codelists.
  2. Find out whether it is possible to add a value, e.g. codelist, to the list of allowed values in gpkg_data_column_constraints.constraint_type

As for 2: requirement 58:

An extension MAY define additional tables or columns. An extension MAY allow new values or encodings for existing columns.

An example:

gpkg_data_columns

table_name column_name name title description mime_type constraint_name
legislationcitation level     The level at which the legislative instrument is adopted. legislationlevelvalue

gpkg_data_column_constraints

constraint_name constraint_type value ...
legislationlevelvalue codelist http://inspire.ec.europa.eu/codelist/LegislationLevelValue ...

As for the value in the actual table, e.g. legislationcitation,

level ...
european ...

I haven't thought this completely through, but I think we should either consider adding a property notation to the INSPIRE registry, see also https://www.w3.org/TR/skos-reference/#notations or doing something as what currently is discussed in https://github.com/opengeospatial/NamingAuthority/issues/92 with "CURIE", "A syntax for expressing Compact URIs", see also https://www.w3.org/TR/curie/.

We need to make sure that the exact value from the code list is unambiguous if we don't want to specify the full URI. In INSPIRE, slash namespaces (/) are usually used, but what if another community uses hash namespaces (#) (see also https://www.w3.org/TR/swbp-vocab-pub/)? Or what if there is a code list in which certain values actually have a URI with a different first part? Certain code lists are extensible, so a national data provider may define extra values with a URI in a domain owned by that data provider.

Additional note on "notation": that would give the possibility to say that the notation for https://inspire.ec.europa.eu/codelist/AdministrativeHierarchyLevel/1stOrder is "1", that the notation for https://inspire.ec.europa.eu/codelist/AdministrativeHierarchyLevel/2ndOrder is "2", etc., resulting in a smalller Geopackage, see also https://docs.ogc.org/per/20-019r1.html#_enumerations.

In any case, I think we need to describe our preferred solution as a GeoPackage extension, and take contact to OGC:

Implementers that are interested in developing their own extensions are encouraged to contact OGC to ensure that the extensions are developed in accordance with OGC policies and in a way that minimizes risks to interoperability. OGC will consider adopting externally developed extensions that address a clear use case, have a sound technical approach, and have a commitment to implementation by multiple implementers.

heidivanparys commented 3 years ago

I did some research and some more brainstorming, please have a look and comment.

Some more information regarding "notation".

Concept

Regarding the concept, see term "notation" in ISO 25964-1:2011 and the comment for term "identifier" in ISO 25964-1:2011, both from the Information and documentation domain.

notation class code class number classmark set of symbols representing a concept in a structured vocabulary, especially a classification scheme EXAMPLE: Notation Source vocabulary Concept
07.04.4 ILO Thesaurus fishery policy and development
622.342 2 Dewey Decimal Classification gold mining
373.3.016:51 Universal Decimal Classification mathematics curriculum in primary schools
SBS XEJ B Bliss Bibliographic Classification endangered species law
H40-H42 International Statistical Classification of Diseases and Related Health Problems glaucoma

Note 1 to entry: Notation is sometimes used to sort and/or locate concepts in a predetermined systematic order and, optionally, to display how the components of complex concepts have been structured and grouped. A notation can provide the link between alphabetical and systematic lists in a thesaurus. In the context of classification schemes, "concepts" are often known as "subjects", especially when they are complex, as in the examples above.

identifier set of symbols, usually alphanumeric, designating a concept or a term or another entity for purposes of unique identification within a determined context or resource, especially in a computer system or network Note 1 to entry: A notation is sometimes used as an identifier.

"Notation" is closely related to "coded value" from the information technology domain, see term "code value" from ISO/IEC 2382:2015:

code value code element code result of applying a code to an element of a coded set Note 1 to entry: Examples: "CDG" representing Paris Charles-de-Gaulle in the code for three-letter representation of airport names; the hexadecimal number 0041 representing "Latin capital letter A" in ISO/IEC 10646-1. Note 2 to entry: code value; code element; code: terms and definition standardized by ISO/IEC [ISO/IEC 2382-4:1999]. [...]

(and "controlled vocabulary" is closely related to "code set")

"Notation" is implemented in SKOS:

A notation is a string of characters such as "T58.5" or "303.4833" used to uniquely identify a concept within the scope of a given concept scheme.

A notation is different from a lexical label in that a notation is not normally recognizable as a word or sequence of words in any natural language.

It is also proposed for implemented in schema.org, see property https://schema.org/codeValue of type https://schema.org/CategoryCode.

Examples of use

ISO 639-2 uses "notation" to indicate the 3-letter code, see https://id.loc.gov/vocabulary/iso639-2.html. An example (see also https://id.loc.gov/vocabulary/iso639-2/eng.html):

<http://id.loc.gov/vocabulary/iso639-2/eng> <http://www.w3.org/2004/02/skos/core#notation> "eng"^^<http://www.w3.org/2001/XMLSchema#string> .

And an example using schema.org, from https://schema.org/CategoryCode:

    [
            {
                    "@context": "https://schema.org/"
            },
            {
                    "@type": "CategoryCodeSet",
                    "@id": "http://id.loc.gov/vocabulary/iso639-2",
                    "name": "ISO 639-2: Codes for the Representation of Names of Languages"
                    "hasCategoryCode": "http://id.loc.gov/vocabulary/iso639-2/cze"
            },
            {
                    "@type": "CategoryCode",
                    "@id": "http://id.loc.gov/vocabulary/iso639-2/cze",
                    "codeValue": "cze",
                    "name": {
                            "en": "Czech",
                            "fr": "tchèque",
                            "de": "Tschechisch"
                    },
                    "inCodeSet": "http://id.loc.gov/vocabulary/iso639-2"
            }

EEA is using "notation" in (some of? all of?) their code lists, see e.g. http://dd.eionet.europa.eu/vocabulary/inspire/DesignationSchemeValue/view and http://dd.eionet.europa.eu/vocabulary/inspire/DesignationSchemeValue/rdf:

image

<skos:Concept rdf:about="http://dd.eionet.europa.eu/vocabulary/inspire/DesignationSchemeValue/nationalDesignationTypeCode">
<skos:notation>nationalDesignationTypeCode</skos:notation>
<skos:prefLabel>National CDDA designations</skos:prefLabel>
<!-- ... -->
</skos:Concept>

OGC uses "notation" in many of their code lists, see e.g. code "bp" from https://github.com/opengeospatial/NamingAuthority/blob/9d132468957d1bb9a90bd06fb8e01f999985d92e/definitions/conceptschemes/doc-type.ttl:

<http://www.opengis.net/def/doc-type/bp>
  rdf:type skos:Concept ;
  policy:status status:valid ;
  rdfs:label "Best Practices Document"^^xsd:string ;

  skos:notation "bp"^^policy:lcname ;
  skos:prefLabel "Best Practices Document"^^xsd:string ;
.

I also found examples from Norway (@MortenBorrebaek, @jetgeo), see e.g. code 181 from https://register.geonorge.no/kodelister/byggesoknad/bygningstype (but they use dcterms:identifier instead of skos:notation in the RDF encoding):

<skos:Concept rdf:about="https://register.geonorge.no/kodelister/byggesoknad/bygningstype/garasje-uthus-eller-anneks-til-bolig/980b340e-e402-4c2b-8163-01a3f3aaf7fd">
        <skos:inScheme rdf:resource="https://register.geonorge.no/kodelister/byggesoknad/bygningstype" />
        <skos:topConceptOf rdf:resource="https://register.geonorge.no/kodelister/byggesoknad/bygningstype" />
        <skos:prefLabel xml:lang="no">Garasje, uthus eller anneks til bolig</skos:prefLabel>
        <!-- ... -->
        <dcterms:identifier>181</dcterms:identifier>
        <!-- ... -->
    </skos:Concept>
    <gml:dictionaryEntry>
        <gml:Definition gml:id="byggesoknad.181">
            <gml:metaDataProperty>
                <gml:GenericMetaData>
                    <status xsi:type="gml:StringOrRefType">Sendt inn</status>
                </gml:GenericMetaData>
            </gml:metaDataProperty>
            <!-- ... -->
            <gml:identifier codeSpace="https://register.geonorge.no/kodelister/byggesoknad/bygningstype">181</gml:identifier>
            <gml:name>Garasje, uthus eller anneks til bolig</gml:name>
        </gml:Definition>
    </gml:dictionaryEntry>

Notation is not used in INSPIRE, probably because of the focus on GML. From the GML 3.3 specification:

Definition and Dictionary encoding is part of the GML schema as a stop-gap, pending the availability of a suitable general purpose dictionary model. Since the GML Dictionary schema was developed, standards on this topic within the semantic web community have emerged and matured. In particular best-practice is to generally use URIs for referring to items in vocabularies, and RDF (OWL, SKOS) for encoding their descriptions.

But! That all makes sense indeed for publishing data "on the web" using all the best practices for that, but having all these URIs in a relational database is less practical and not the way data is managed in a relational database, where "code tables", also called "lookup tables" or "picklist tables", are used.

If all INSPIRE code list values would have a "notation" as well, we would be able to use that as the value to put in a GeoPackage column. And if the code list registry would be available as a SQLite file as well, in addition to the formats supported today, it would even be easier for users to combine the data. And to transform data to GML and pick the correct URI, based on the URI of the code lists itself and the notation, where needed. That would be two requests for development of https://github.com/ec-jrc/re3gistry.

Advantage 1: the value to use in a GeoPackage file is unambiguous and explicit. As opposed to "assuming" that the value to use is the last part of the URI.

Advantage 2: it gives the possibility to use other codes than the last part of the URI. For INSPIRE those probably would be the same in many cases (but "1" could perhaps be used for https://inspire.ec.europa.eu/codelist/AdministrativeHierarchyLevel/1stOrder instead of "1st Order", see also earlier comment). However, in many non-INSPIRE-harmonized datasets, numeric codes, or short 2-letter or 3-letter codes are used. A Danish example from https://danmarksadresser.dk/adressedata/kodelister/livscyklus/:

code value label
2 proposed
3 current
4 retired
... ...

Which is relevant as we are trying to develop a non-INSPIRE specific extension.

thorsten-reitz commented 3 years ago

Results of the discussion on 30.06.2021:

heidivanparys commented 1 year ago

I think there is no need to keep this issue open. Having worked with GeoPackage and the specification more, I don't think that the GeoPackage specification should be extended regarding code lists.

I still think that the INSPIRE registry should contain a field with the code value for use in encodings that are not necessary “semantic web enabled. For example, I would like to see https://inspire.ec.europa.eu/codelist/AdministrativeHierarchyLevel/1stOrder/1stOrder.en.rdf contain <skos:notation>1stOrder</skos:notation>. But that should be addressed in another forum.

Regarding the elaborations in the comments above, the upcoming ISO/DIS 19103 will contain an annex on code lists, so any comments and discussions can take place in TC 211.