CatalogueOfLife / coldp

33 stars 11 forks source link

Add areaID to Distribution Entity #40

Closed bart-v closed 4 years ago

bart-v commented 4 years ago

https://github.com/CatalogueOfLife/coldp#distribution does not contain a proper way to add an area (GUID) identifier

mdoering commented 4 years ago

It was anticipated to have area holding the ID in case of all gazetteers other than TEXT. Adding areaID might be less ambiguous, but means area and areaID would be mutually exclusive.

bart-v commented 4 years ago

I don't see why they would necessarily be mutually exclusive. area could be a human readable version of the areaID

mdoering commented 4 years ago

That's true. But the ID would dictate what the area really is, not the human "label". The label would not be relevant for sharing and ignored in favour of the key and its normative title, potentially even even translated into different languages. We don't share labels for ranks, statuses or other vocabularies as part of an archive.

bart-v commented 4 years ago

OK, so where it it explained that "area" can contain an identifier?

mdoering commented 4 years ago

Nowhere it seems ;) I will update the readme which is all we have at this stage.

dhobern commented 4 years ago

I've been making use of Distribution records in my datasets and agree that there is room for improvement, Right now, the closest thing that we have to a unique identifier for the area is the combination of the gazetteer and the area. Different areas might have the same "area" value in different gazeteers. My use case is to write a command line tool for editing the contents of COLDP files directly and there are only two or three areas where I'm hitting current problems. This is one of them.

I think we should distinguish four strings and discuss which of these should be embedded in the Distribution record - if we are not careful, we will also need a Region or Area record to make sure that we have all we need. The four strings are:

  1. Human readable name for region - "Sabah".
  2. Code for region in gazetter - "MY-12" for Sabah in ISO.
  3. Genuinely unique ID for region within the dataset - even if the dataset uses multiple gazetteers
  4. URI or other GUID for region - ideally linking to much more information.

My personal preference would be to engineer this whole space rather better and for TDWG to host explicit and consistent Gazetteers that include all this information and probably also shape file data for the TDWG geographic region list, for the ISO list, etc. Then users could reference these from inside our YAML metadata or else supply their own equivalent Gazetteer definitions inside the COLDP package. That's a little vague but I could explain it further.

Then the patterns for supplying distribution information inside COLDP could be as follows.

1 - Default minimum - text-only distribution information provided denormalised for each record

2 - Default recommended - Explicit URI-based pointers to a gazetteer

3 - Alternative - user supplies custom gazetteer

mdoering commented 4 years ago

I don't think GUIDs or URLs are always the way to go.

I much more like the idea of reusing existing standards and combine a local id (area) provided by the standard with the namespace (gazetteer) that these values are unique in. It is also much more in harmony with the rest of the standard that nowhere mandates GUIDs or URLs as identifiers.

I like the idea of describing the gazetteer in the YAML file. But probably only for custom additions to the standard ones ColDP lists already. Better than YAML is to have a selected number of supported gazetteers that allows us to know exactly what we are dealing with and use supporting shape files, hierarchies, translated human label or whatever we want & get hold of.

Is it useful to share Germany as the area when you have the iso country code DE already for sharing? You introduce options for inconsistency. I found the previous COL distribution model suffering mostly from its inconsistent use.

Said that I am open for having both area and areaID as the later is easier to understand. Shall we?

dhobern commented 4 years ago

It has certainly seemed surprising to me that in just this one place, we drop the use of ID and simply have "area". As I noted, there is an issue with the possible non-uniqueness of area ids across multiple gazetteers so I would like to have a mechanism to guarantee/assert that there is a defined link from a Distribution record to an area.

I started saying something in my previous note to this issue but then got sidetracked. We need to be able to include the relevant external ids like DE for Germany in the ISO gazetteer. We also need to have an ID for the area that will be unique in the context of the COLDP package. If only one gazetteer is used, this may happen naturally, but even so, the implication is that users need to check both the area(ID) and the gazetteer to be sure that the area is the one intended. In my case, I've started including an additional region.csv file in the package (could easily be distribution.csv) to capture the information that I feel I must include to make my COLDP package meaningful (and also so that it is possible to generate human-readable versions simply from joining data within the package).

Going with area (human text name) and areaID (gazetteer specific ID) seems a good step forward.