CLARIAH / wp2-GISLOD

Exchange for issues and code related to transposition of gemeentegeschiedenis.nl-data to LOD
MIT License
0 stars 0 forks source link

Clarify what a Gemeentegeschiedenis URI is #7

Closed wouterbeek closed 7 years ago

wouterbeek commented 7 years ago

Disclaimer: I have little experience with using Gemeentegeschiedenis URIs. Some of the here raised issues may be due to my lack of knowledge.

Identity properties of Gemeentegeschiedenis URIs

Am I correct in the following: Gemeentegeschiedenis URIs try to capture the notion of a municipality. According to this approach, a municipality ceases to exist when it's name changes. A municipality also ceases to exist when the name of the province in which it is located is changed (e.g., when Noord-Holand, Zuid-Holland and Utrecht will become the Randstad province in some -- possibly imaginary -- future scenario). When a municipality with name A and geometry X ceases to exist, and later in time a municipality with the same name A and the same geometry X originates, these two municipalities are considered to be different and not the same. However, a municipality does not cease to exist when only its geometry changes (and it's name does not). From this it follows that a municipality A does not cease to exist when it is merged with another municipality B, as long as the resulting municipality has the same name as A.

Character replacements

Gemeentegeschiedenis URIs use the name of a municipality in their path, where apostrophe's, comma's and spaces are replaced with underscores.

Apostrophe's and comma's are both allowed in URIs, so replacing them with underscores is -- technically speaking -- superfluous. At the same time, replacing three different characters with the same replacing characters is not an innocent operation: one loses information. E.g, the imaginary municipalities A, B and A 'B would both map to A__B.

Given that the replacement of two of the three characters is not necessary, and that information loss is avoided by mapping one character rather than three characters to an underscore, I suggest to no longer replace apostrophe's and comma's in Gemeentegeschiedenis URIs.

Here is the relevant ABNF snipped from [[https://tools.ietf.org/html/rfc3986][RFC 3986]]:

segment     = *pchar
pchar       = unreserved / pct-encoded / sub-delims / ":" / "@"
unreserved  = ALPHA / DIGIT / "-" / "." / "_" / "~"
pct-encoded = "%" HEXDIG HEXDIG
sub-delims  = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="

Multiple character replacements

In the case of multiple character replacements, (undocumented?) contraction seems to be applied as well. E.g., Tull en 't Waal should be Tull_en__t_Waal after replacement, but appears as Tull_en_t_Waal instead.

Ensuring spacial uniqueness

Since there can be more than one municipality with the same name across space, a province-identifying suffix is sometimes used. E.g, Achttienhoven in Utrecht and Achttienhoven in Zuid-Holland. The former is called Achttienhoven_Ut; the latter is called Achttienhoven_ZH.

This approach makes three non-trivial assumptions:

  1. That the provinces are themselves stable. (What happens to the two Achttienhovens when Zuid-Holland and Utrecht merge? This example is not entirely academic, given the recurring plan to merge the 'randstad' provinces.)
  2. That municipalities within the same province cannot have the same name. This assumption may already be violated with Wissekerke_ZBev and Wissenkerke_NBev?
  3. That municipalities cannot be in more than one province at the same time.

Ensuring temporal uniqueness

Since there can be more than one municipality with the same name across time, a numeric suffix is sometimes used. E.g., Abcoude in 1818 and Abcoude in 2017. The former is called Abcoude_1; the latter is called Abcoude_2.

The numeric suffix has no connection to the time at which the municipality exists (not necessarily a bad thing). It is unclear whether the space suffix precedes the time suffix, or the other way around. (But this can be easily established by convention.)

rlzijdeman commented 7 years ago

Gemeentegeschiedenis URI's are based on the Amsterdam Code (designed in 2006, 2011). Most of the issues you raise in the section Identity properties of Gemeentegeschiedenis URIs are actually referring to the design of the Amsterdam Code. I would like to ask you to please read pages 7-9 and 30-34 of the Repertorium van Nederlandse Gemeenten vanaf 1812 (Van der Meer & Boonstra, 2011).

Gemeentegeschiedenis URI's have remained close to the design of the Amsterdam Code and likewise suffer from issues you mention re the counting and province-namegiving. However, if I'm not mistaken they actually also take care of those issue, by providing unique URI's for multiple instances of the same municipality (like the municipalities that by name stopped to exist and then came back, and municipalities with equal names at different areas in the Netherlands.

I think if we are going to adopt some of the improvements you propose, we also need to worry about continuity and congruity of the Amsterdam Code and its representation in Linked Data.

RinkeHoekstra commented 7 years ago

I think it is worthwhile to explore possibilities for improvements in the uri scheme. For instance by following FRBR style separation of work and expression-level identifiers, and "guessable" uri's. This would make our scheme future proof. A mapping to the Amsterdam code should always be present. Of course, if this proves to be too ambitious, the Amsterdam code will be the fallback option.

Sent from my iPhone

On 10 Feb 2017, at 08:11, Richard Zijdeman notifications@github.com wrote:

Gemeentegeschiedenis URI's are based on the Amsterdam Code (designed in 2011). Most of the issues you raise in the section Identity properties of Gemeentegeschiedenis URIs are actually referring to the design of the Amsterdam Code. I would like to ask you to please read pages 7-9 and 30-34 of the Repertorium van Nederlandse Gemeenten vanaf 1812 (Van der Meer & Boonstra, 2011).

Gemeentegeschiedenis URI's have remained close to the design of the Amsterdam Code and likewise suffer from issues you mention re the counting and province-namegiving. However, if I'm not mistaken they actually also take care of those issue, by providing unique URI's for multiple instances of the same municipality (like the municipalities that by name stopped to exist and then came back, and municipalities with equal names at different areas in the Netherlands.

I think if we are going to adopt some of the improvements you propose, we also need to worry about continuity and congruity of the Amsterdam Code and its representation in Linked Data.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

mmmenno commented 7 years ago

The thing it boils down to: we need unique identifiers for municipalities.

This could easily be accomplished by using uuid's. These are, however, not very human-readable. Hence the usage of municipality names as they were found in the Repertorium dataset (with some minor changes, like _ for spaces and commas, and the deletion of ' - see function with all changes below).

One could argue unique identifiers are meant to identify concepts uniquely, not to inform the reader of the provincial whereabouts of the concept it refers to. The postfix _ZH doesn't have to mean anything (though it'll tell you the municipality 'probably originally in Zuid-Holland'), it's just a way to discriminate it from another municipality that has the same name.

function name2url($name){

        $from = array(" ","[","]","'",",","Ængwirden","Súdwest_Fryslân","Skarsterlân","Wûnseradiel","Gaasterlân-Sleat","Babyloniënbroek","Ohé","Odiliënberg");
        $to = array("_","","","","_","Aengwirden","Sudwest_Fryslan","Skarsterlan","Wunseradiel","Gaasterlan-Sleat","Babylonienbroek","Ohe","Odilienberg");

        $name = str_replace($from,$to,$name);

        return $name;

    }
ivozandhuis commented 7 years ago

All suffixes (provincial-like, "NBev"/"ZBev", numbers) are all derived from the names in the repertorium, mentioned by Richard. (search in the pdf for instance for Wissekerke or Abcoude) By using that convention we were sure to create unique name/timeperiod-combinations, without extra work for us. We excepted the existing convention, which we considered thought-through by the authors of the amco.

Richard is responsible for new names. If the Randstad-province ever really comes in place he has to decide how to deal with that. In theory he could keep the suffixes ZH and Ut to distinct the two Achttienhovens, because we probably use those names for another thousand years. In that sense it becomes comparable to NBev and ZBev.

wouterbeek commented 7 years ago

Thanks for the useful feedback on this issue! I have made a second Linked Data version that also includes the Amsterdamse Code. Let's discuss this topic further during our meet-up on the 1st of March.