AAFC-BICoE / dina-planning

AAFC-DINA planning repository
3 stars 2 forks source link

Geographic Places #138

Closed cgendreau closed 3 years ago

cgendreau commented 3 years ago

Main ticket to present the concept of Geographic Places.

Geographic places can be used to get decimal Lat/Long and uncertainty from the verbatim locality. It can also be used to "tag" a record to a specific geopolitical "place".

heathercole commented 3 years ago

awesome! we will NEED to be able to tell the difference between lat/long from this type of source vs. manual/human effort. If possible, would be optimal to include associated information such as the area of uncertainty or estimated coverage. A centroid lat/long for a tiny town or small lake would have a different level of error/precision/certainty than a large city or great lake. Of course these would need to be already associated to the lat/long. There are some resources here that may be relevant as there are lat/long values, but also polygons associated with features (eg. lakes) At one point I exported a lot of this and provided to GeoLocate developer to increase coverage for Canadian place names.

dshorthouse commented 3 years ago

We will differentiate authoritative, linked, service-based sources for Places from those created denovo. The former will transparently draw-in parent administrative areas with their machine-readable anchors & Darwin Core keys to help supplement discoverability but will not allow independent editing of these for fear of disconnecting any future service calls or introducing ambiguity. And so, the hoped-for functionality here with respect to external services like GeoLocate is that a user links to a single Place, at the most granular level of discoverability in that source. With that single choice and linking action is the transparent pull of its more coarse levels of administrative areas stored as an encapsulated path.

heathercole commented 3 years ago

for the (awesomely useful) options for integration services like GeoLocate (which can go to more precision than just place-name), it will necessary to be able to maintain/record meta-data about the system-generated points.

For example, if your verbatim locality is "78 km west of Ottawa, Ontario" and the system generated a lat/long for you, we need to be able to tell the difference between lat/long from an "Ottawa" polygon vs. a lat/long that incorporates the additional verbatim info (eg. "78 km west of") as the coordinates (from GeoLocate) would be different, as well as the precision/uncertainty Ottawa 78km west of Ottawa

Users can modify GeoLocate output (eg. a circular uncertainty radius may not be appropriate for "west of"), but in the examples above, I am focusing on what could be likely automated.

If the verbatim locality was "78km west of Ottawa along the Trans-Canada Hwy" then even more precision would be possible by a user (but not GeoLocate)

TravisGobeil commented 3 years ago

I have an early demonstration of how we could manage at least some of this data, if not all. It is early, so please contribute and we can take this idea to fruition.

Please note: This is a very basic demo - it doesn't include the "typing" actions I merely click through them. Please use your 'demo' goggles and view this as a "what if" rather than a "what will be". It is meant to generate discussion and be improved upon, but I think it at least satisfies the basic user requirements. Thank you!

Demo Animatic showing how you can enter the data according to the User Requirements listed below: https://youtu.be/IJM_QHUB5BY

From my perspective, the Geography/Place does not involve Lat/Long. That is for Georeferencing. I am restricting my definition of "Geography or Place" (I am ambivalent as to the preferred name). In my definition it is a place that is definable according to Geopolitical terms, described with text strings. There are no coordinates in this part of the collecting event. I am simply trying to define my "Place" in a common language that others would understand as being that same "Place". We can get very specific (point data) or keep it very general (continent).

User Requirements

I have three requirements for places that I am trying to accommodate with the following wireframes and demo:

  1. A User shall be able to "Interpret" or translate the Verbatim Locality into the appropriate Geographic fields, and then correct these data points as required.
  2. A User shall be able to add Geographic Place data from least specific to most specific (Continent > Municipality)
  3. A User shall be able to add Geographic Place data using the most specific data, and the system will back-fill the hierarchical data (e.g. I enter the city, it fills in Province < Country < Continent automatically)

Basic wireframe showing my fields thus far:

Diagram-Geography-Empty

dshorthouse commented 3 years ago

@TravisGobeil Thanks, that's precisely what we're aiming for here. There is plenty of material for discussion such as:

  1. Should a hierarchy(ies) of Place names be relational & stored in a single home, i.e. edit once and the edit is applied everywhere (WARNING: incurs a significant maintenance burden)
  2. Should a hierarchy of Place names be flattened and stored as little, unlinked text-based snippets within each collecting event, i.e. NOT relational as in 1 above, requiring that Place names be examined across multiple collecting events should a global edit be needed
  3. If each Collecting Event has its own little hierarchy of Place names, are these self-referential in any way? ie. The system should not allow me to enter "Quebec" in the Country field and when I type "Canada" in the Country field, I should be presented with (and limited to) the known Canadian Provinces in State/Province
  4. When calling an external service, should we be free to edit what comes back for any and all parent Place names in the hierarchy that's provided? (WARNING: functional disconnect from that service ensues)
  5. When calling an external service, should we merely use it to link the lowest, most granular, available Place as a single "tag" (with retention of machine-readable identifier, drawing on other info like hierarchy as needed)?
  6. How do we best combine historic Place names, perhaps with their time-dependent hierarchies (eg. Kingdom of Prussia => Holy Roman Empire), with contemporary assertions of those Place names & their contemporary hierarchies (eg. Germany => Europe)? Use case: Show me all specimen records that have not yet been georeferenced or cannot be georeferenced from Germany, regardless of how old those specimens are.
  7. What requirements do we have for external Place name services? What are the must haves and what are the nice to haves?
  8. How do we best combine reference-able data from remote services with local edits, especially when there might be nothing produced from those remote services?
  9. How do we best combine water bodies with geopolitical Places, especially when there is often a blend of these two that results in ambiguity? eg Lac Philippe, Quebec, Canada vs. Lake Temiskaming, Quebec/Ontario (could be both), Canada
  10. Do we treat water bodies as completely separate from geopolitical Places?
  11. Do we care about having to accommodate ongoing geopolitical disputes eg Hans Island (both Canada and Denmark claim ownership of it in the High Arctic)?
  12. Do we want to support Place names in multiple languages?
dshorthouse commented 3 years ago

For what it's worth, the consensus with the German side of DINA is a preference for number 5 above. The rationale is that what we're really trying to do here with Place names is to facilitate search when georeferencing cannot be or has not yet been completed. None of us sensu lato is interested in maintaining a hierarchy of Place names where ever or however these are stored. But, it would be really useful if in the process of using a remote service, I could choose a single, most granularly available Place and all the goodies of all the upper levels of the hierarchy, inclusive of ambiguities, were captured as a single, uneditable, structured unit. I could later search on any of those Places in the snippet hierarchy from a single search box to retrieve records, but I never need to see all of them and I certainly never need to edit them. I can kill that single tag and choose a different Place & I need not have to worry about additionally flushing out all those levels of the hierarchy that were implicitly bundled within that tag. Likewise, assuming what came back from the service was sufficiently structured, I'd be able to publish those populated Darwin Core terms, ambiguities notwithstanding. And, most attractive of all (if the service supported it), we could store that "tag" and its uneditable geoJSON blob of hierarchies in any and all additional languages we cared about.

cgendreau commented 3 years ago

As a first step, I would actually try to only rely on the service and the entries would not be modifiable.

Probably displaying them as breadcrumbs "Ottawa/Ontario/Canada" with a link to see that place on the "source" website.

heathercole commented 3 years ago

It seems to me, this is very similar to the verbatim and structured 'Agent' fields. The (verbatim) locality is entered, then, there is a search which links to structured place names. If the structured place-names are NOT modifiable, then there also needs to be functionality for a user to add their own, in the case where a place-name is not available, or is ambiguous. There also needs to be some capacity to either synonymize place-names or create aliases so that verbatim localities such as 'CEF Ottawa'; 'Ottawa Central Experimental Farm'; 'ORDC'; and Ottawa Research and Development Centre' can all be connected to the same "structured place name".

Above, what I mean by ambiguity, is what if there is a place-name (eg. a Rocky Mountains) which goes across more than one province/state/country. Verbatim locality may indicate "Rocky Mountains, BC", so a user would want to be able to include "British Columbia" as part of the structured information, not just "Rocky Mountains"

dshorthouse commented 3 years ago

@heathercole As we're talking named Places and not geographic features (eg Rocky Mountains), a better example might be Lloydminster.

heathercole commented 3 years ago

regarding demo: I love it! a great start!

A few notes: As demonstrated by this example, I don't think that Geolocate should be used as the authority for Canadian place name structure/hierarchy. It is really not meant for that. GeoLocate is awesome for adding coordinates to text strings, but there should be something more authoritative for entries. Does GeoLocate know about Nunavut? I think the change to Ottawa-Carleton county may be only be a few years more recent. (answer seems to be "sort of")

Related to this, as the 'county' level in Canada is potentially a mix of county/regional municipality there is the question as to whether this is a needed field, it would not be acceptable to have it being automatically populated with out-of-date/incorrect data.

A 3-tier approach is often selected, with 1) place name, 2)province/state level 3) country. Is it necessary for the collection managers to include 'Continent'? Does it makes sense for "Place names" to be separated by type; Water body; island group; etc. If there is a place name that is a lake on a island in a country? does that data export in several different fields? or is it concatenated into fewer? What would a data export look like that included the 'locations' below, what data fields would be included?

Central Experimental Farm, Ottawa, Ontario vs Treasure Island on Lake Mindemoya on Manitoulin Island on Lake Huron, Canada (which is in Ontario, but that 'wasn't on the label') http://www4.rncan.gc.ca/search-place-names/unique/FCXIX

I think there are lots of answers/solutions possible, but some of them also structure fields/function at this stage.

edit: ps. the approach to enter data from MOST specific is a requirement, most current systems have this functionality and is a feature that supports the best use of people's time, and is really a significant component of using structured names.

heathercole commented 3 years ago

@dshorthouse geographic features may also appear on specimen labels, "Rocky Mountains, British Columbia" has more relevant information than only "British Columbia". A lat/long centroid based on the related area within BC would be (abeit marginally) more relevant than a centroid based on all of BC.

as noted above, I think there are several options/approaches, but we need to review at what level the managers need to maintain/access their data to continue to develop how the system interacts with structured names and associated information. The collections are all understaffed, so while managers are always responsible for data-quality, the system should support making that as accessible as possible, with least possible burden to their time.

dshorthouse commented 3 years ago

The collections are all understaffed, so while managers are always responsible for data-quality, the system should support making that as accessible as possible, with least possible burden to their time.

Exactly why I wrote:

None of us sensu lato is interested in maintaining a hierarchy of Place names where ever or however these are stored.

Managers also recognize that any user-based structured entry kicks the burden down the road of future maintenance because Place names are always a moving target. This is why 5 above is most attractive as a supplement, assuming the service(s) remain(s) functional and it/they manage timely updates. The expectation is that once grabbed, we'd have routines to re-verify by calling those services as needed. The key is we'd not just take text from the service but also transparently capture the stable URIs such that we can unambiguously refer to any one Place as do others for countless other purposes, eg https://sws.geonames.org/6094817/.

As for authoritativeness, we'll not ever find a comprehensive authority. But one requirement for 7 above is that any useful resource has the capacity to be updated or at the very least, has a documented procedure for how amendments are accepted, executed, and timestamped. Would you add coordinates, polygons, or other non-textual elements to what should be available & captured from an external service?

TravisGobeil commented 3 years ago

Thanks for your feedback. Based on the discussion here I am trying to pull together concrete User Requirements. I like the idea of a simple third-party 'tag' but then the science becomes difficult, and reliant on a third-party system. We need to be able to enter data however we want, don't we?

Maybe I'm late to the party, but how come no one has used the term "Toponymy" yet? That's the word we need to centre ourselves around the name of a place, not the precise location (that's Georeferencing).It should be understood that when we are talking about Geography or Placenames, we mean "Toponymy"

Can I get some feedback regarding this specific User Requirement:

User Requirement: A User shall be able to describe a place using commonly accepted geopolitical definitions (known as "Toponymy")

Sub-Requirements:

Would this satisfy all input requirements? Can you list scenarios this set of requirements cannot accommodate and I'll try to massage them or correct them?

michellelocke commented 3 years ago

I'm not fully grasping your ideas for this and hopefully it can be clarified in the meeting. It's really feeling more fancy than we need. Is this meant to be the interpreted location information? So if my label says: CA: ON: Ottawa, CEF, then this is the place to interpret the country, province and municipality to Canada, Ontario and Ottawa?

Where is the field for an interpreted more specific locality? CEF would need to be interpreted as Central Experimental Farm; Pt. Pelee would be Point Pelee National Park, etc. Also location qualifiers like 8km S Ottawa, aren't taken into account when you only stop at municipality with interpreted data. I would also argue that Rocky Mountains is a more specific location within BC. It is absolutely part of the location and is not a "geographic feature". Where is that part of the location going to go? How will these more specific locations that don't fit in the box of country, province or municipality be dealt with?

The CNC DB uses Country, Province and Location. Location encompasses everything that does not fit into country or province but has to do with the location of collection (everything you could find on a map). That is all we need. A place for verbatim and interpreted countries, provinces and locations.

It is a must to be able to enter this data manually and not refer to a service. We currently use a picklist for countries and from that a restricted list of provinces is available to select from. For many of our locations I can see if being easier for us to just choose the country and province from a picklist. I do like Travis's demo showing that you can override data and it is clear which data are from the service and which are overwritten. That is a nice feature.

Toponomy isn't a word we use. You are welcome to use it in discussions but I would not like to bring that word into the database. To me locality is the word we use to refer to a place, anything you can find on a map, no matter how specific or vague. Georeferencing refers to the process of adding coordinates. It does have to do with a specific place, but it's more about putting a dot on a map and how confidant you are of that dot. Locality is like giving someone instructions on how to find a place, georeferencing is pointing them to that place on a map (no instructions, just a dot). They are different.

One last note for now, in the video, Travis mentioned that Geographic Places might not be near the Verbatim locality field. I'm not sure what the reasoning would be to have them separated. In the final product I would like to see all related fields together. It is harder for someone to grasp the full picture of where a specimen was fount if the verbatim and interpreted data are not together.

dshorthouse commented 3 years ago

Thanks, @michellelocke. Yep, easy to get caught in the weeds. It's a question of scale & maintainability. If it's a few thousand items with narrow geographic scope and temporal range, that's a manual solution. If it's tens of thousands of items with mixed global scope and a bit more temporal range, you'd make updates as a team. If it's 5-10 million items with considerable global scope with significant date ranges, we're edging into more of an industrial approach. The larger the scope, size, and variability in our inputs, the more flexible must be the ways we add structure to verbatim localities to make our records discoverable either for ourselves or for search by outsiders.

We're assuming that under some circumstances, the structure added to verbatim might be limited to nothing more than toponyms. Other times, it's a blend of both georeference data and toponyms. More contemporary records have georeferenced data and, although the toponyms are nice add-ons, the search trajectory might be better accommodated with "draw on a map & show me the stuff from here".

It is a must to be able to enter this data manually and not refer to a service. We currently use a picklist for countries and from that a restricted list of provinces is available to select from.

Agreed, but you've also identified here a rather significant requirement that somewhere within DINA is a singular, static, time-insensitive hierarchy of States/Provinces or other administrative areas nested under Countries. If all our data were in Canada, collected in narrow ranges of time, then no big deal. We're assuming however that CNC and other collections are global with rather significant ranges of geopolitical eras. I would expect that at some point, either at the time of entry or some time thereafter, such a pick-list becomes stale, requiring that Place names (whatever the intent in their capture) be updated. If that pick-list were fully relational, such blanket changes in a hierarchy of Place names might be dangerous. If that pick-list were more like a gazetteer with glorified copy/paste, then that requires a different sort of downstream update. Your pick-list might be better served by an external service. Maybe. Again, it's a question of scale.

re: Central Experimental Farm. Would that have ordinarily been recorded as a Site?

rintoult commented 3 years ago

I would like to reinforce the concept of a site that would not be mapped - things like Joe Blow's farm - this could be covered off by verbatim as was mentioned above.

I remember it being mentioned in previous discussions, we also have changes over time, the provinces of South Africa all changed in the 90s or something like that so we would need to be able to capture things which would might no longer be covered off by current lists.

I am sure this is all covered above - i might not have read all the sentences of all the essays included in this ticket.

Tara

michellelocke commented 3 years ago

@dshorthouse we do not record site separately. Location houses all information that does not fit into Country or Province but points you to the location (anything describing the location is habitat). A site might not be geopolitical but is still mappable. I can look up the CEF in google maps and be pointed to a location. There is still a need to be able to interpret these verbatim location data as they are often written in short form on small labels.

We deal with political boundary change all the time. Ideally one would enter verbatim what is on the label (an out of date country name, like Ceylon) and add the current data to the assertation (Sri Lanka). I would never want this changed on it's own if the geopolitical boundaries change in the future. I would want to do that work manually (either by single record or by batch but I am the one doing the work to add an up-to-date assertation). I still don't get how this system would help with changing geopolitical boundaries in an efficient manner. After reading your above comments @dshorthouse, if you are drawing the picklist from a service that is something that I understand and am totally fine with. Maybe I am overthinking Geographic Places. Is it just using a service to source your data from?

If this is the case then great. But I also need to have my interpreted Location information right alongside the Geographic Places. these cannot be separated. Can we add in a Location box for all other data? This box will never be populated by the service and will be for manual entry of text for anything that doesn't fit in the geopolitical box.

michellelocke commented 3 years ago

Separate note: can we call it Interpreted Locality? That is so much more clear than Geographic Places and makes it very clear that you have Verbatim and Interpreted data.

dshorthouse commented 3 years ago

Unitary values for Geographic Place is one source of our problem here. Darwin Core gives no guidance here. @michellelocke and @rintoult have both identified Number 6 in my big list above. Do we want two? As in a Contemporary Geographic Place tab (perhaps through a service and/or manual entry) that will be in flux by design and a more static, Chronologically Correct Geographic Place? Is there a reason for doing this that is not related to search?

rintoult commented 3 years ago

In the meeting the discussion of other languages, another example of country place name mismatches and issues with being able to flag data. We tried to develop a "Nagoya Restriction Scan" and hardly any of our country names matched to the UN list of signatories. Eg Micronesia vs Federated States of Micronesia, Republic of Moldova, UK = United Kingdom of Great Britain and Northern Ireland and on and on.

heathercole commented 3 years ago

another note about bilingualism here (perhaps only related to web-presence), the structured geographic names should probably be available in French too. The verbatim can be in English, but Places names should probably be in both, particularly province and Country. I'm not sure the 'search' functions need to be able to search in French, but certainly Botany has lots of labels in 'other' languages, verbatim can cover the text, but not clear if user must translate to do a search (eg. ferme expérimentale de Frelighsburg)

Place names in English and French https://www.nrcan.gc.ca/earth-sciences/geography/places-official-names-english-and-french/9239

New Brunswick / Nouveau-Brunswick

Second Falls (Falls) | Deuxième Sault (Chute) Caissie Cape (rural community) | Cap-des-Caissie (communauté rurale) Grand Falls (Town) | Grand-Sault (Ville)

Application Programming Interface - API Description The Canadian GeoNames Search Service allows users to search for current and formerly official geographical names found in the Canadian Geographical Names Data Base (CGNDB) through an Application Programming Interface (API). The CGNDB is Canada’s national authoritative geographical name database. The API is a tool to query the CGNDB using Uniform Resource Identifiers (URIs) like those seen below. Such URIs may be inserted into your Web pages or applications and allow searches by: •geographical name •unique key •coordinates •alphabetical list

dshorthouse commented 3 years ago

In the meeting the discussion of other languages, another example of country place name mismatches and issues with being able to flag data. We tried to develop a "Nagoya Restriction Scan" and hardly any of our country names matched to the UN list of signatories. Eg Micronesia vs Federated States of Micronesia, Republic of Moldova, UK = United Kingdom of Great Britain and Northern Ireland and on and on.

Does this translate to a requirement for the storage of multiple representations & variants of country names in structured fields depending on the venue where those data are to be shared? Or, does this mean you merely need to be able to edit the country names once in each of the Collecting Events where mismatches are found regardless of whether or not those country names originated from migration of data or a service?

dshorthouse commented 3 years ago

the structured geographic names should probably be available in French too.

Through what mechanism could you ever toggle geographic names (and the names of their upper administrative levels) in other languages if, as we decided, the link to a service that could have provided such a toggle is functionally severed? If we can edit place names drawn from a service, we no longer have a link to the service. There's no magic in whatever might be the web presence to accommodate this unless it bubbles-up from underneath.

Unless I fail to see something, this is an either/or. Either we edit the Place names from a service or we don't. If we don't edit them, we could consume both English and French when first called (assuming the service provided them). If we do want to edit them, how do we store language-based variants in a single field?

cgendreau commented 3 years ago

For countries we will also keep the ISO code so languages and Nagoya Restriction will be much easier to handle at that level.

dshorthouse commented 3 years ago

For countries we will also keep the ISO code so languages and Nagoya Restriction will be much easier to handle at that level.

But would you flush the ISO code when/if the country name is edited? Or, would you also wire-up the service calls to the country field such that whenever it's edited, you'd fetch a refreshed ISO code? There'd be nothing to prevent a user from changing Canada to Italy in the UI (and likewise all the toponyms) after a service call had been made so somehow we'd have to preserve the correct association between text and ISO.

cgendreau commented 3 years ago

In order to keep it simple, at least for now, if the user changes something it's over for all the data above that. So if you edit the country: country, provinces ... are all disconnected so nothing from the service would be preserved and the ISO code would no be available. The first version will not be editable anyway.

heathercole commented 3 years ago

perhaps on your radar already, but I am doing some data imports and trying to address some geography/place name issues, the data I am importing says "South Korea" which is "Republic of Korea" in the database, I think it cannot be expected that most users know the 'official' names, especially when they can be so variable (North Korea = Democratic People's Republic of Korea) etc. I think starting searches 'low' with place names will be very helpful for that. If there is still a need to start 'at the top' then verbatim searching may be relevant, but if someone types 'South Korea' and given the 2 choices above, could be an issue. I am not trying to say that users can't have any responsibility, but their main time/task should be spent on the label transcription, not trying to validate place names, if they have the information "South Korea" on a label, they should be able to associate to the correct country name in the management system without external sources. This is common for a lot of countries (eg Norway = Kingdom of Norway) with commonly used names different than official names. I think there are many possible solutions

cgendreau commented 3 years ago

If it's on the label it's verbatim so there is no Place to validate if the goal is label transcription.

As for the name of the country that's not a problem since the ISO code is stored so the official name is can be attached/displayed later. An external source is required since again, you could have the name of the country in multiple languages so it would quickly become a curatorial issue.

heathercole commented 3 years ago

yes, I think we are saying (mostly) the same thing, "South Korea" would be captured in the verbatim field, but needs to be associated to the 'real/current/official' country in the structured field, if there was a town/place also included and was returned in a search, then association to the country wouldn't be an issue (we assume), but if the only info on the label was 'South Korea' and, since that is a very commonly used name/alias, the system needs to be able to support associating to the correct country (eg. via ISO code). The same way that if someone searched for "Slovenia" they would return the association for the structured-data-field to the country "Republic of Slovenia" . I think I am trying to point out that there are these 'different' names, both of which are current (not historical). I think we are on track, just wanted to communicate this use-case that I am struggling with in Specify.

This may also relate to what data is exported, I imagine lots of the place-name searches use the 'common name', but likely appropriate for data to be exported with official names? It would not be acceptable to provide the public/clients an export with only ISO codes.

cgendreau commented 3 years ago

export with official names is usually safer I would say.

cgendreau commented 3 years ago

Moving to more concrete tickets.