gbif / parsers

Various GBIF parsers for dates, countries, language, taxon ranks, etc
Apache License 2.0
4 stars 8 forks source link

Continent Parser insufficient #26

Closed tucotuco closed 1 year ago

tucotuco commented 4 years ago

Arctos has been having discussions (https://github.com/tdwg/dwc-qa/issues/128, https://github.com/ArctosDB/arctos/issues/3043, https://github.com/ArctosDB/arctos/issues/1291) about their higher geography, and as always, how the data are accessible on GBIF is of great interest.

I have been participating in their discussion and tracking down why things appear as they do, and my conclusion is that the continent parser approach to interpreting continent is not good. It is being treated as a simple vocabulary (https://github.com/gbif/parsers/blob/master/src/main/resources/dictionaries/parse/continents.tsv), relying on the incoming raw data and mapping them to the seven-continent model.

Nothing wrong with the seven-continent model. The problem is that the raw data don't necessarily follow it, or map unambiguously to it even if they share the same vocabulary (e.g., Oceania isn't the same for all data publishers even if the word matches exactly with a vocabulary value, see image below). Without GBIF being able to make sense of the data at the continent level, I believe there will be little incentive for people to actually model their continental geography in a way that is unambiguous.

The following image shows results of search for interpreted continent "Oceania" with suspicious coordinate locations removed. There are locations outside of the view provided as well, but this much is sufficient to illustrate the point that a string vocabulary match on a concept that is geographical is more problematic than helpful.

image

To index on continent without creating more chaos where less existed in the raw data, I think it would be a far better approach to add a simple layer in the geocode api for continents and use the coordinates to make that interpretation, along with a corresponding warning flag.

If you are concerned about the ~7% of occurrences that don't have interpreted decimal coordinates, a solution could be to use a vocabulary that takes advantage of the interpreted country code. The problem is, there would have to be some omissions or spurious assignments for the country codes that signify regions in more than one continent (RU, ES, CO, CL,EG,AZ,GE,KZ,TR,VE,US,FR,IT,PT,YE,GR). Well, OK, that list is too long, forget that idea.

MattBlissett commented 4 years ago
  1. There are various definitions of continent. Ours should probably include continental islands (Tasmania in Oceania, Great Britain in Europe). It's also common to group the remaining islands with a "nearby" continent, so Fiji in Oceania etc.
  2. Does the continent include seas on the continental shelf? I.e. the Bass Strait and the English Channel.
  3. At what distance from the shore are occurrences no longer on any continent? We have a map for EEZs, but that's only political and presumably isn't useful here. (See https://labs.gbif.org/geocoder/ )
  4. If seas are part of a continent, where is the boundary to the ocean? We might already have a map for that, SeaVoX or IHO.
  5. (I mistakenly thought there was a DWC term ocean, which might be the complement to dwc:continent.)

Essentially, we need to define what our continents are, and either create a shapefile or (preferably) use someone else's.

tucotuco commented 4 years ago

I propose first that continent include terrestrial locations while the marine ones complement those and use dwc:waterbody. That's the main reason continent and waterbody were split from ContinentOcean in the pre-standard Darwin Core in 2007.

Doing so would allow us to build the continent shapes from and therefore be border-consistent with GADM, which is being used for the rest of terrestrial higher geography. With that in place, the

The marine locations could then be made from the complement of the continents for the shores and the IHO marine regions for the inter-marine borders.

For simplicity and utility I propose that EEZs should be a separate geographic concept that is also eventually indexed and searchable.

timrobertson100 commented 4 years ago

I propose first that continent include terrestrial locations

Is it worth first verifying how continent is expected to be used? If intended to help with regional reports there may be an expectation that data from territorial waters be included.

Might it also be worth considering if it is indeed desirable to buffer into the sea to accommodate coastal records (there are many) and the problem of inaccuracies in the polygons?

tucotuco commented 4 years ago

I was afraid politics might be brought into it. I would caution to be prepared for a hundred different ways of dividing up the geography for grouping purposes. To avoid that I think the best solution is to let people draw what they want. That will miss the 8% that don't have interpreted coordinates - but then again, most of those can't be provided with coordinates anyway.

Maybe a good complement would be to allow people to save and name their regions of interest. Hmm, sounding like locality services. ;-)

On Wed, Aug 26, 2020 at 4:22 AM Tim Robertson notifications@github.com wrote:

I propose first that continent include terrestrial locations

It is worth first verifying how continent is expected to be used? If intended to help with regional reports there may be an expectation that data from territorial waters be included.

Might it also be worth considering if it is indeed desirable to buffer into the sea to accommodate coastal records (there are many) and the problem of inaccuracies in the polygons?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/gbif/parsers/issues/26#issuecomment-680707149, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADQ723YNFRC6FW3BVERTX3SCSZ2DANCNFSM4QHVYCWA .

MattBlissett commented 4 years ago

Several countries span more than one continent, so I don't think any definition of continent it's particularly useful for a political report.

Once you go beyond a coastline, the question is "how far?".

tucotuco commented 4 years ago

Agree with @MattBlissett And in terms of reports, how far depends on jurisdictions as much as biology or shape precision. I'd make cuts exactly where GADM does, then separate buffers to be included or not by choice as an additional criterion.

MattBlissett commented 3 years ago

If we were to interpret continent, we'd need to

tucotuco commented 3 years ago

If we were to interpret continent, we'd need to

  • Find definitions of the continents, as the terrestrial borders between them do not align with political boundaries

ArcGIS Hub has shape files for an eight continent model (separating Australia from Oceania) at https://hub.arcgis.com/datasets/57c1ade4fa7c4e2384e6a23f2b3bd254_0?geometry=73.828%2C-89.382%2C-73.828%2C86.054.

MattBlissett commented 3 years ago

I started to assemble a 7-continent model from GADM polygons — that is, splitting Turkey into GADM1 regions, and splitting Çanakkale and Istanbul further into GADM2 regions, and assigning these to Europe or Asia. GADM areas don't exist everywhere (e.g. the South American and Oceanian islands of Chile are the same administrative district; an Egyptian district straddles the Suez Canal/River and isn't subdivided).

See https://labs.gbif.org/~mblissett/2021/04/continents/ and https://en.wikipedia.org/wiki/Boundaries_between_the_continents_of_Earth

This gets us close, but I think some work with (probably) OpenStreetMap features is necessary to get a Europe/Asia boundary along the Ural river, and for some small irregularities on other borders. The Ural River and mountains will be the most difficult part.

Question: do we want a purely political definition of the continents (i.e. assembled administrative regions), or a purely geographic definition (i.e. Ural river, Suez Canal/River, checking the Colombia/Panama border really is the watershed), or a mix, which I'll call a cultural definition (Italy and Malta all in Europe, even the islands on the African plate).

Continents based on GADM will be very easy to implement and maintain. Continents based on rivers/watersheds etc will be more difficult to implement, but (assuming the mountains don't move very fast) should also be easy to maintain.

Does anyone know of a strict definition of the Europe/Asia boundary? "The Ural mountains" isn't clear at the very north and (probably) south.

The problem is, there would have to be some omissions or spurious assignments for the country codes that signify regions in more than one continent (RU, ES, CO, CL, EG, AZ, GE, KZ, TR, VE, US, FR, IT, PT, YE, GR).

I haven't split Azerbaijan, Georgia or Italy over multiple continents.

ArcGIS Hub has shape files for an eight continent model

This was a useful start, but it's very basic -- Egypt all in Africa, Turkey all in Asia etc.

tucotuco commented 3 years ago

Tough call. I suggest watching https://www.youtube.com/watch?v=3uBcq1x7P34 (again), throw your hands up in the air and flip a coin.

Seriously though, I think this depends on where it will be most used. There might be a slight edge for socio-political, given that science doesn't care much about continents.

On Mon, Apr 26, 2021 at 11:59 AM Matt Blissett @.***> wrote:

I started to assemble a 7-continent model from GADM polygons — that is, splitting Turkey into GADM1 regions, and splitting Çanakkale and Istanbul further into GADM2 regions, and assigning these to Europe or Asia.

See https://labs.gbif.org/~mblissett/2021/04/continents/ and https://en.wikipedia.org/wiki/Boundaries_between_the_continents_of_Earth

This gets us close, but I think some work with (probably) OpenStreetMap features is necessary to get a Europe/Asia boundary along the Ural river, and for some small irregularities on other borders.

Question: do we want a purely political definition of the continents (i.e. assembled administrative regions), or a purely geographic definition (i.e. Ural river, Suez Canal/River, checking the Colombia/Panama border really is the watershed boundary), or a mix, which I'll call a cultural definition (Italy and Malta all in Europe, even the islands on the African plate).

The problem is, there would have to be some omissions or spurious assignments for the country codes that signify regions in more than one continent (RU, ES, CO, CL, EG, AZ, GE, KZ, TR, VE, US, FR, IT, PT, YE, GR).

I haven't split Azerbaijan, Georgia or Italy over multiple continents.

ArcGIS Hub has shape files for an eight continent model

This was a useful start, but it's very basic -- Egypt all in Africa, Turkey all in Asia etc.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/gbif/parsers/issues/26#issuecomment-826904731, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADQ72YKGCF32HW7ZK4MUX3TKV5VTANCNFSM4QHVYCWA .

MattBlissett commented 3 years ago

While looking through what data we had, I saw the WGSRPD standard split additional countries into multiple continents, taking a biogeographical approach rather than a political approach. That is the closest I have for a use case where the boundaries are significant.

Since that approach is easily adapted to a political approach by including/excluding a few countries or GADM regions (but the reverse isn't possible), I've put together a similar set of polygons — WGSRPD itself was too low-resolution.

Here's a preview — the GADM0 area borders are an artefact of the assembly process: continents

We'll still need to work out how to interpret occurrence data using this ­— probably similarly to GADM, after coordinate and country interpretation has completed, essentially setting the continent and adding an issue if the provided continent differs. (What about non-georeferenced occurrences?)

tucotuco commented 3 years ago

That looks marvelous. For non-georeferenced records, most can be resolved by lookup on the country. Those that can't can be left blank, no?

On Wed, Apr 28, 2021 at 2:15 PM Matt Blissett @.***> wrote:

While looking through what data we had, I saw the WGSRPD http://www.tdwg.org/standards/109 standard split additional countries into multiple continents, taking a biogeographical approach rather than a political approach. That is the closest I have for a use case where the boundaries are significant.

Since that approach is easily adapted to a political approach by including/excluding a few countries or GADM regions (but the reverse isn't possible), I've put together a similar set of polygons — WGSRPD itself was too low-resolution.

Here's a preview — the GADM0 area borders are an artefact of the assembly process: [image: continents] https://raw.githubusercontent.com/gbif/continents/master/gadm-continents.png

We'll still need to work out how to interpret occurrence data using this ­— probably similarly to GADM, after coordinate and country interpretation has completed, essentially setting the continent and adding an issue if the provided continent differs. (What about non-georeferenced occurrences?)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/gbif/parsers/issues/26#issuecomment-828630346, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADQ725ADWRSGCS6YET46HDTLA7CJANCNFSM4QHVYCWA .

MattBlissett commented 3 years ago

We could parse the published continent value, then compare it to the interpreted country/countryCode (SE must be Europe, TR can be Europe or Asia etc), removing it if it conflicts.

We would end up with some non-georeferenced marine observations having a continent, where that has nevertheless been provided by the publisher.

But, this isn't much different from how we interpret country/countryCode without coordinates.

tucotuco commented 3 years ago

That all seems reasonable to me.

On Wed, Apr 28, 2021 at 3:29 PM Matt Blissett @.***> wrote:

We could parse the published continent value, then compare it to the interpreted country/countryCode (SE must be Europe, TR can be Europe or Asia etc), removing it if it conflicts.

We would end up with some non-georeferenced marine observations having a continent, where that has nevertheless been provided by the publisher.

But, this isn't much different from how we interpret country/countryCode without coordinates.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/gbif/parsers/issues/26#issuecomment-828682220, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADQ722AK5WW2MUJ5FIF2BDTLBHY3ANCNFSM4QHVYCWA .

MattBlissett commented 1 year ago

This is now working on gbif.org, thanks for everyone's comments.

For the moment, the map still shows data where there is a continent issue (unlike for country issues). In time, if data quality improves, we can exclude by default occurrences with incorrect continent values.