gbif / portal-feedback

User feedback for the GBIF API, website and published data. You can ask questions here. 🗨❓
30 stars 16 forks source link

geocoding/flagging iso2 centroid locations on GBIF #4232

Closed jhnwllr closed 5 months ago

jhnwllr commented 1 year ago

background

Sometimes data publishers will not know the exact lat-lon location of a record and will enter the lat-long center of the locality instead. This is a data issue because users might be unaware that an observation is pinned to a locality center and assume it is a precise location.

From previous work it is known that most centroids are coming from museum collections (basisOfRecord=PRESERVED_SPECIMEN).

false positives problem

@MattBlissett has pointed out that in some cases, if we were to flag records, we would end up flagging many "non-centroid" false positives.

In the figure, the UK centroid where many non-centroid human observations are mixed with fewer "real centroids" from likely retrospectively geo-coded records. Many museum records are sitting directly on the centroid, but as a user you are probably also concerned with the few museum records somewhat further away from the centroid. (rings are 2km and 5km buffers)

image

Publishers would probably not like to have many records flagged that just happen to be near centroids.

Of course, for some centroids this isn't a problem at all.

image

users vs publishers

Users who want to filter for centroid locations, are more interested in making sure no outliers make it into their models, than false positives. So most users would rather over flag centroids.

Publishers would rather we be more judicious with flagging, so their datasets don't get littered with false positives.

This is why I recommend treating centroids more neutrally as a geocoded location rather than a data quality flag.

data quality flag vs location filter

Fake potential UI below.

image

In my view, thinking about centroids as useful, interesting neutral locations rather than as data quality problems/flags makes centroids easier to work with. Since we are never going to eliminate all false positives, it makes more sense to treat centroids as locations. This becomes even more apparent when we start talking about province and state centroids, which are also useful locations, but will produce even more "false positives".

There are some really small provinces but it would still be useful to geocode the centroids. image

One disadvantage to treating centroids as simply geocoded locations would be that we might need to include an additional column in downloads to make it useful for users. Also it is difficult to filter out unwanted records with the current interface.

Below you can review 30 sampled centroids for iso2 places over 30K sqkm. It is usually impossible to tell if a point on a centroid is a "real" centroid but usually if a preserved specimen is somewhat close to known centroid, it is highly likely to be a "real centroid".

iso2 name n_preserved_specimen n_human_observation source centroid_2km
AO Angola 154 0 geolocate link%20)
AU Australia 17 2 CoordinateCleaner link%20)
BE Belgium 1009 12011 geolocate link%20)
CM Cameroon 20 0 geolocate link%20)
CA Canada 1326 0 geolocate link%20)
CN China 1979 0 geolocate link%20)
CI Cote d'Ivoire 368 0 CoordinateCleaner link%20)
GL Greenland 122 0 CoordinateCleaner link%20)
GT Guatemala 1287 80 geolocate link%20)
HU Hungary 54 136 CoordinateCleaner link%20)
IE Ireland 94 158 geolocate link%20)
IT Italy 2005 1 geolocate link%20)
LR Liberia 278 0 geolocate link%20)
MG Madagascar 2024 0 geolocate link%20)
MR Mauritania 3 0 geolocate link%20)
MA Morocco 1 0 CoordinateCleaner link%20)
NL Netherlands 476 22423 geolocate link%20)
NZ New Zealand 24 225 CoordinateCleaner link%20)
NI Nicaragua 152 128 geolocate link%20)
NE Niger 68 0 CoordinateCleaner link%20)
NG Nigeria 64 74 CoordinateCleaner link%20)
SA Saudi Arabia 0 0 CoordinateCleaner link%20)
RS Serbia 92 5 CoordinateCleaner link%20)
SE Sweden 45 866 CoordinateCleaner link%20)
CH Switzerland 167 72 CoordinateCleaner link%20)
TW Taiwan 1694 2099 CoordinateCleaner link%20)
AE United Arab Emirates 5 0 CoordinateCleaner link%20)
GB United Kingdom 76 1134 CoordinateCleaner link%20)
VN Vietnam 304 4 geolocate link%20)
ZW Zimbabwe 5 29 geolocate link%20)

This brain dump are my current thoughts. Open to any divergent opinions or discussion.

@timrobertson100 @ahahn-gbif @MattBlissett

jhnwllr commented 1 year ago

@MortenHofft has raised the point that the data quality flag for publishers could be separated from the neutrally geocoded records. I agree with this perspective.

The data quality flag could have the following properties:

jhnwllr commented 1 year ago

We probably want the action/menu to be "exclude within 2/5/10km of centroid".

tucotuco commented 1 year ago

@jhnwllr This is great work. However, I want to challenge a premise.

Users who want to filter for centroid locations, are more interested in making sure no outliers make it into their models, than false positives. So most users would rather over flag centroids.

Scientists who do not want inappropriate data in their models need to verify those data before using them. Without the explicit measure of uncertainty they can't do that, centroid or not, outlier or not. That doesn't mean there is anything wrong with your attempts to help highlight records that require review before use, but there is a MUCH simpler and all-encompassing test for that. Does the location have either a geospatial footprint (an actual geometry in the data) or uncertainty for the coordinates (as a distance)? Every record that gets used in modeling should have one of those. If the records were properly georeferenced, you wouldn't have to worry about geocoding flagging you would already have it in coordinateUncertaintyInMeters. Another focus of effort could be to provide the coordinates and uncertainty for records that do not have coordinates, but do have unambiguous administrative geography, to the highest specificity you can.

Independently, keep in mind that there are also many different kinds of centroids (e.g., Australia has at least five), and your proposal covers just one of those.

ArthurChapman commented 1 year ago

There are a number of issues with (country) centroids not discussed above. I agree largely with what @tucotuco has said.

However, if we take Australia. Many early records for Australia just say "Nova Hollandia" or "Australiia" and a default has been placed at the "center" of Australia [but see below]. When these observations were made, the only European discovery was around the Australian coast - and definitely not in the desert in central Australia. A better representation would be a footprint derived by buffering the coast (or in many cases, just the east coast around Sydney/Botany Bay, etc.) One ends up with a record as the centroid which is in the Australian desert for an observation from the wet coast.

Just to illustrate this point, there is a record of a marine starfish on GBIF at -25.274, 133.775 (middle of Australia's desert) - which in itself is an interesting centroid that I think comes from the CIA database and for which I have not worked out the scientific basis for.

The determination of the centroid of a country can be carried out using a number of methods - none are wrong, just determined using different methods. In the case of Australia, this can result in differences of hundreds of kilometers.

Nova Hollandia" or "Australia" on ALA
-25.274, 133.775
(24 records on ALA within 1 km)

Centre of Gravity Method
23° 07' South, 132° 08' (-23.1166667, 132.33333)
(0 records on ALA within 1 km)

Lambert Gravitational Centre
25° 36' 36.4"S, 134° 21' 17.3"E (-25.610111, 134.3548056)
(Lots of records within 1km - also it comes up on the map on ALA about 300 m from where Geosciences Australia places it!) – Datum problem?

Furthest Point from Coastline
23° 02'S, 132° 10'E (-23.033333, 132.1666667)
(0 records on ALA within 1 km)

Geodetic Median Point
23° 33' 09.89"S, 133° 23' 46.00"E (-23.5527472, 133.396111)
(8 records on ALA within 1 km)

Johnston Geodetic Station
25° 56' 49.3"S, 133° 12' 34.7"E (-25.9470278, 133.2096389)
(lots of records on ALA within 1 km)

Notethat a lot of records from both Lamberts Centre and from Johnson's Geodetic Centre are actual observations from those places (I have made recent observations there myself with a determined uncertainty) - i.e. high precision records. Many, however, are defaults of "Australia"

So, in summary, there are several issues.

What is "Australia"

  1. Mainland Australia
  2. Mainland Australia + Tasmania
  3. Mainland Australia + Tasmania + main islands (Lord Howe, Norfolk, Christmas
  4. EEZ
  5. Continental Shelf
  6. Australia + territories including Australia Antarctica Area.

How was the center (centroid) determined

  1. Centre of Gravity Method
  2. Lambert Gravitational Centre
  3. Furthest Point from Coastline
  4. Geodetic Median Point
  5. Johnston Geodetic Station

There are similar issues with the Australian States and determining the centroids there.

MortenHofft commented 1 year ago

Scientists who do not want inappropriate data in their models need to verify those data before using them. Without the explicit measure of uncertainty they can't do that, centroid or not, outlier or not

Records without an uncertainty stated: 1,373,034,036 (out of 2,104,374,546 with coordinates)

A flag that the record is missing uncertainty might be useful in that case since ≈2/3 are missing it.

tucotuco commented 1 year ago

I think it would be useful, but also make sure not to flag it if there is a footprintWKT provided that is not just a POINT.

jhnwllr commented 1 year ago

I have taken the time to extract centroids for iso2 places from the Getty Thesaurus of Geographic Names (TGN). I have pasted them all here for reference.

The TGN is a source of many centroids on GBIF.

Interestingly, we are still missing the centroids that @ArthurChapman points to for Australia. Extracting these centroids is already a bit dodgy, so source="Arthur Chapman" might a good solution for AU centroids.

The point about dwc:coordinateUncertaintyInMeters that @tucotuco raises is totally correct. But I think we can still geocode centroids while also encouraging publishers to fill in this very important field.

iso2 tgn_name lat lon n_specimen n_hobservation source centroid_2km
MX Mexico 23 -102 18648 16 TGN link%20)
PH Philippines 13 122 7604 6 TGN link%20)
SG Singapore 1.3667 103.8 7026 35108 TGN link%20)
MU Mauritius -20.3 57.5833 5947 28 TGN link%20)
NF Norfolk Island -29.033333 167.95 5553 6872 TGN link%20)
SE Sweden 62 15 5000 255 TGN link%20)
CU Cuba 21.5 -80 4748 0 TGN link%20)
CH Switzerland 47 8 4743 23677 TGN link%20)
FR France 46 2 4573 1984 TGN link%20)
JP Japan 36 138 4313 81 TGN link%20)
GY Guyana 5 -59 4180 0 TGN link%20)
DE Germany 51.5 10.5 3769 18 TGN link%20)
FJ Fiji -18 178 3162 4 TGN link%20)
NC New Caledonia -21.5 165.5 3101 9 TGN link%20)
BR Brazil -10 -55 2929 2 TGN link%20)
AU Australia -25 135 2874 9 TGN link%20)
BM Bermuda 32.3333 -64.75 2676 9249 TGN link%20)
CX Christmas Island -10.5 105.6667 2645 226 TGN link%20)
DK Denmark 56 10 2633 4452 TGN link%20)
CD Democratic Republic of the Congo -.0167 25 2585 0 TGN link%20)
CR Costa Rica 10 -84 2495 1154 TGN link%20)
US United States 38 -98 2370 3714 TGN link%20)
BS Bahamas 24 -76 2276 0 TGN link%20)
JM Jamaica 18.25 -77.5 2206 24 TGN link%20)
BN Brunei Darussalam 4.5 114.6667 2130 0 TGN link%20)
PA Panama 9 -80 2037 0 TGN link%20)
MG Madagascar -20 47 2025 0 TGN link%20)
LU Luxembourg 49.75 6.1667 2019 4737 TGN link%20)
IT Italy 42.8333 12.8333 2005 1 TGN link%20)
CN China 35 105 1978 0 TGN link%20)
FO Faeroe Islands 62 -7 1939 81 TGN link%20)
AW Aruba 12.5 -69.9667 1837 3045 TGN link%20)
CW Curacao 12.166 -69 1799 11517 TGN link%20)
NZ New Zealand -42 174 1685 910 TGN link%20)
PG Papua New Guinea -6 147 1636 0 TGN link%20)
HU Hungary 47 20 1623 123 TGN link%20)
LK Sri Lanka 7 81 1620 7 TGN link%20)
ZA South Africa -30 26 1579 0 TGN link%20)
BB Barbados 13.1667 -59.5333 1559 1598 TGN link%20)
SB Solomon Islands -8 159 1500 0 TGN link%20)
TN Tunisia 34 9 1397 4 TGN link%20)
DO Dominican Republic 19 -70.6667 1338 0 TGN link%20)
CA Canada 60 -96 1326 0 TGN link%20)
AD Andorra 42.55 1.583 1308 3238 TGN link%20)
GT Guatemala 15.5 -90.25 1287 80 TGN link%20)
MQ Martinique 14.6667 -61 1284 103 TGN link%20)
EC Ecuador -2 -77.5 1256 0 TGN link%20)
AT Austria 47.3333 13.3333 1206 324 TGN link%20)
CZ Czech Republic 49.75 15 1206 38 TGN link%20)
GP Guadeloupe 16.25 -61.5833 1203 3547 TGN link%20)
BZ Belize 17.25 -88.75 1201 18302 TGN link%20)
IN India 20 77 1190 2 TGN link%20)
MS Montserrat 16.75 -62.2 1141 3798 TGN link%20)
TZ Tanzania -6 35 1138 663 TGN link%20)
KE Kenya 1 38 1132 11 TGN link%20)
FI Finland 64 26 1092 525 TGN link%20)
RU Russia 60 47 1070 0 TGN link%20)
HK Hong Kong 22.25 114.1667 1066 6839 TGN link%20)
BE Belgium 50.8333 4 1009 12011 TGN link%20)
SC Seychelles -4.5833 55.6667 977 4 TGN link%20)
CO Colombia 4 -72 963 2 TGN link%20)
ID Indonesia -5 120 952 0 TGN link%20)
RE Reunion -21.1 55.6 927 2401 TGN link%20)
CL Chile -30 -71 890 4852 TGN link%20)
GL Greenland 72 -40 865 0 TGN link%20)
PE Peru -10 -76 857 3 TGN link%20)
TT Trinidad and Tobago 11 -61 850 0 TGN link%20)
SN Senegal 14 -14 806 12 TGN link%20)
IS Iceland 65 -18 772 189 TGN link%20)
HT Haiti 19 -72.4167 761 0 TGN link%20)
GD Grenada 12.1167 -61.6667 757 701 TGN link%20)
ES Spain 40 -4 753 184 TGN link%20)
TH Thailand 15 100 751 23 TGN link%20)
KN Saint Kitts and Nevis 17.3333 -62.75 743 400 TGN link%20)
MT Malta 35.9167 14.4167 731 1603 TGN link%20)
PY Paraguay -23 -58 721 0 TGN link%20)
PN Pitcairn Islands -25.0667 -130.1 721 105 TGN link%20)
EE Estonia 59 26 720 1147 TGN link%20)
MC Monaco 43.7333 7.4167 713 4454 TGN link%20)
JE Jersey 49.2167 -2.1167 696 2053 TGN link%20)
DZ Algeria 28 3 674 0 TGN link%20)
NO Norway 62 10 620 535 TGN link%20)
GR Greece 39 22 588 19 TGN link%20)
RO Romania 46 25 586 136 TGN link%20)
MN Mongolia 46 105 576 1 TGN link%20)
SR Suriname 4 -56 566 0 TGN link%20)
DM Dominica 15.5 -61.3333 558 3 TGN link%20)
TW Taiwan 24 121 547 20602 TGN link%20)
CV Cape Verde 16 -24 542 0 TGN link%20)
CM Cameroon 6 12 539 0 TGN link%20)
VE Venezuela 8 -66 523 0 TGN link%20)
SK Slovakia 48.6667 19.5 507 3 TGN link%20)
EG Egypt 27 30 499 0 TGN link%20)
LI Liechtenstein 47.1667 9.5333 496 8749 TGN link%20)
GI Gibraltar 36.1333 -5.35 469 8711 TGN link%20)
BL Saint BarthÚlemy 17.9 -62.833 459 5430 TGN link%20)
AG Antigua and Barbuda 17.05 -61.8 444 138 TGN link%20)
CI C¶te d'Ivoire 8 -5 423 0 TGN link%20)
MP Northern Mariana Islands 15.213 145.755 422 1290 TGN link%20)
SH Saint Helena, Ascension and Tristan da Cunha -15.95 -5.7 395 82 TGN link%20)
PT Portugal 39.5 -8 391 2132 TGN link%20)
YT Mayotte -12.8333 45.1667 386 309 TGN link%20)
PL Poland 52 20 380 242 TGN link%20)
MH Marshall Islands 10 167 380 0 TGN link%20)
AM Armenia 40 45 373 9 TGN link%20)
BO Bolivia -17 -65 367 1 TGN link%20)
AI Anguilla 18.2167 -63.05 361 1668 TGN link%20)
IL Israel 31.5 34.75 359 125 TGN link%20)
SY Syria 35 38 359 0 TGN link%20)
BW Botswana -22 24 348 0 TGN link%20)
MW Malawi -13.5 34 346 0 TGN link%20)
KR South Korea 37 127.5 341 90 TGN link%20)
AR Argentina -34 -64 336 10 TGN link%20)
MZ Mozambique -18.25 35 330 0 TGN link%20)
LC Saint Lucia 13.8833 -60.9667 330 190 TGN link%20)
MM Myanmar 22 98 326 0 TGN link%20)
ZW Zimbabwe -19 29 288 0 TGN link%20)
SL Sierra Leone 8.5 -11.5 286 0 TGN link%20)
VN Viet Nam 16 106 280 0 TGN link%20)
LR Liberia 6.5 -9.5 278 0 TGN link%20)
GU Guam 13.4667 144.8333 268 385 TGN link%20)
KM Comoros -12.1667 44.25 255 1 TGN link%20)
UG Uganda 2 33 253 1 TGN link%20)
KY Cayman Islands 19.5 -80.6667 245 0 TGN link%20)
VC Saint Vincent and the Grenadines 13.0833 -61.2 243 10 TGN link%20)
NR Nauru -.5333 166.9167 243 1003 TGN link%20)
IR Iran 32 53 240 53 TGN link%20)
GA Gabon -1 11.75 225 2 TGN link%20)
GE Georgia 42 43.5 223 32 TGN link%20)
UY Uruguay -33 -56 222 0 TGN link%20)
NL Netherlands 52.5 5.75 219 1808 TGN link%20)
MY Malaysia 2.5 112.5 217 0 TGN link%20)
VU Vanuatu -16 167 212 1 TGN link%20)
VG British Virgin Islands 18.5 -64.5 212 1 TGN link%20)
TR Turkey 39 35 210 7 TGN link%20)
NG Nigeria 10 8 210 0 TGN link%20)
NI Nicaragua 13 -85 205 0 TGN link%20)
PW Palau 6 134 200 13 TGN link%20)
NP Nepal 28 84 198 0 TGN link%20)
BJ Benin 9.5 2.25 197 33 TGN link%20)
PK Pakistan 30 70 192 0 TGN link%20)
BD Bangladesh 24 90 191 2 TGN link%20)
HR Croatia 45.1667 15.5 191 4 TGN link%20)
HN Honduras 15 -86.5 188 0 TGN link%20)
NA Namibia -22 17 187 7 TGN link%20)
YE Yemen 15.5 47.5 174 0 TGN link%20)
WF Wallis and Futuna Islands -13.3 -176.2 168 84 TGN link%20)
MA Morocco 32 -5 167 8 TGN link%20)
GH Ghana 8 -2 163 6 TGN link%20)
SA Saudi Arabia 25 45 162 0 TGN link%20)
AO Angola -12.5 18.5 154 0 TGN link%20)
TO Tonga -20 -175 150 19 TGN link%20)
CY Cyprus 35 33 133 124 TGN link%20)
KH Cambodia 13 105 127 0 TGN link%20)
SD Sudan 16 30 126 0 TGN link%20)
ET Ethiopia 8 39 122 0 TGN link%20)
VA Holy See 41.903 12.453 119 9301 TGN link%20)
GF French Guiana 4 -53 116 365 TGN link%20)
IQ Iraq 33 44 115 0 TGN link%20)
NU Niue -19.0333 -169.8667 114 11 TGN link%20)
MO Macau 22.1667 113.55 106 2072 TGN link%20)
KP North Korea 40 127 105 0 TGN link%20)
PF French Polynesia -15 -140 105 0 TGN link%20)
GM The Gambia 13.5 -15.5 104 45 TGN link%20)
OM Oman 21 57 102 0 TGN link%20)
UA Ukraine 49 32 102 0 TGN link%20)
IE Ireland 53 -8 94 158 TGN link%20)
GN Guinea 11 -10 87 0 TGN link%20)
ZM Zambia -15 30 85 0 TGN link%20)
FK Falkland Islands -51.75 -59 83 0 TGN link%20)
MV Maldives 3.2 73 77 0 TGN link%20)
BA Bosnia and Herzegovina 44.25 17.8333 73 0 TGN link%20)
TC Turks and Caicos Islands 21.7333 -71.5833 69 0 TGN link%20)
KW Kuwait 29.5 47.75 65 0 TGN link%20)
SZ Swaziland -26.5 31.5 65 2 TGN link%20)
SV El Salvador 13.8333 -88.9167 65 1 TGN link%20)
SM San Marino 43.9333 12.4167 64 109 TGN link%20)
BT Bhutan 27.5 90.5 64 4922 TGN link%20)
LB Lebanon 33.8333 35.8333 60 3 TGN link%20)
BV Bouvet Island -54.4333 3.4 57 17 TGN link%20)
PS Gaza Strip 31.4167 34.3333 56 75 TGN link%20)
AF Afghanistan 33 65 55 0 TGN link%20)
WS Samoa -13.8 -172.133333 53 35 TGN link%20)
LV Latvia 57 25 53 6 TGN link%20)
GW Guinea-Bissau 12 -15 52 22 TGN link%20)
LA Laos 18 105 49 0 TGN link%20)
GQ Equatorial Guinea 2 10 49 0 TGN link%20)
DJ Djibouti 11.5 42.5 42 0 TGN link%20)
CF Central African Republic 7 21 41 0 TGN link%20)
CG Congo -1 15 41 0 TGN link%20)
KI Kiribati -5 -170 40 0 TGN link%20)
NE Niger 16 8 38 1 TGN link%20)
ER Eritrea 15 39 38 0 TGN link%20)
AL Albania 41 20 38 4 TGN link%20)
PS State of Palestine 31.92157 35.20329 38 640 TGN link%20)
SO Somalia 6 48 35 0 TGN link%20)
PM Saint Pierre and Miquelon 46.8333 -56.3333 31 2 TGN link%20)
IM Isle of Man 54.25 -4.5 29 3438 TGN link%20)
UZ Uzbekistan 41 64 29 0 TGN link%20)
TJ Tajikistan 39 71 28 0 TGN link%20)
MK North Macedonia 41.666 21.75 26 61 TGN link%20)
KZ Kazakhstan 48 68 24 0 TGN link%20)
LY Libya 25 17 22 0 TGN link%20)
RW Rwanda -2 30 21 26 TGN link%20)
TM Turkmenistan 40 60 21 0 TGN link%20)
ML Mali 17 -4 20 1 TGN link%20)
CC Cocos Islands -12 96.8333 17 4 TGN link%20)
LT Lithuania 56 24 17 3 TGN link%20)
KG Kyrgyzstan 41 75 17 0 TGN link%20)
BF Burkina Faso 13 -2 16 0 TGN link%20)
SJ Svalbard 78 20 16 404 TGN link%20)
ME Montenegro 42.5 19.3333 16 23 TGN link%20)
ST Sao Tome and Principe 1 7 15 0 TGN link%20)
TL Timor-Leste -8.5833 126 14 1 TGN link%20)
AZ Azerbaijan 40.5 47.5 14 0 TGN link%20)
SI Slovenia 46.083 15 14 15 TGN link%20)
JO Jordan 31 36 13 1 TGN link%20)
FM Federated States of Micronesia 5 152 12 0 TGN link%20)
AS American Samoa -14.3167 -170.5 12 0 TGN link%20)
TG Togo 8 1.1667 12 0 TGN link%20)
GS South Georgia and South Sandwich Islands -56 -33 11 0 TGN link%20)
TV Tuvalu -8 178 10 0 TGN link%20)
LS Lesotho -29.5 28.25 10 0 TGN link%20)
PS West Bank 32 35.25 10 118 TGN link%20)
BI Burundi -3.5 30 9 0 TGN link%20)
GB United Kingdom 54 -4.5 9 162 TGN link%20)
TF French Southern and Antarctic Lands -43 67 8 1 TGN link%20)
TD Chad 15 19 7 0 TGN link%20)
BG Bulgaria 42.666 25.25 6 37 TGN link%20)
BH Bahrain 26 50.5 6 267 TGN link%20)
AE United Arab Emirates 24 54 5 0 TGN link%20)
TK Tokelau -9 -171.75 4 0 TGN link%20)
MD Moldova 47.25 28.583 4 18 TGN link%20)
BY Belarus 53 28 4 0 TGN link%20)
GG Guernsey 49.5833 -2.333 3 0 TGN link%20)
IO British Indian Ocean Territory -7 72.0167 3 0 TGN link%20)
SS South Sudan 7.5 30 3 0 TGN link%20)
QA Qatar 25.5 51.25 2 1 TGN link%20)
MR Mauritania 20 -12 2 1 TGN link%20)
HM Heard Island and McDonald Islands -53 73 0 0 TGN link%20)
IN Bassas da India -21.4167 39.7 0 0 TGN link%20)
CK Cook Islands -16.083 -161.583 0 0 TGN link%20)
RS Serbia 44.166 20.833 0 0 TGN link%20)

TGN source

other centroid sources already imported into the geocoder

https://github.com/gbif/geocode/blob/e1609c922f840939d9ccecf0ce8b1ef9a473f019/database/geolocate_centroids.sql https://github.com/gbif/geocode/blob/e1609c922f840939d9ccecf0ce8b1ef9a473f019/database/coordinatecleaner_centroids.sql

ArthurChapman commented 1 year ago

Further to my earlier post, and comments by @jhnwllr above - of the five centroids for Australia, there are two common ones that have been used for specimens and observations in the past.

The first is Lamberts Gravitational Centre (see reference at https://www.atlasobscura.com/places/lambert-centre-of-australia).

The second is Johnston's Geodetic Centre (see reference at https://www.xnatmap.org/adnm/docs/2013/1965%20JGS2.htm which also discusses how this and other "centres" were calculated). This latter paper shows the complexities in determining country/continental centroids.

For those interested - there is a paper here on the five Australian Centroids plus centroids for each of the Australian States and Territories (https://www.ga.gov.au/scientific-topics/national-location-information/dimensions/centre-of-australia-states-territories)

jhnwllr commented 1 year ago

I decided that collecting these centroids needed more organization, so I made repo to aggregate different centroid sources into one source. https://github.com/jhnwllr/catalogue-of-centroids

@MattBlissett

ArthurChapman commented 1 year ago

Great job @jhnwllr. Wouldn't be great to have the detailed methodology for each of the centroids. I know, from looking at Australia's, detailed methodologies are very difficult to find. Looking at the Australian ones, interesting that you found 9. I guess some of the more southern ones include Tasmania, whereas many of the others are for mainland Australia. Good job.

MattBlissett commented 1 year ago

To implement this, I think we should have a table of all reasonable centroids (Lambert's or Johnston's or geolocate or TCN or anyone else's method) for countries and country-like things (Australia with and without Tasmania, the UK with and without Shetland, USA with and without Alaska and Hawaii etc). (Some are already on the debug map -- NB some layers may crash a browser, but the centroid layer is fine.)

During interpretation, either for all records, or records without an uncertainty, or specimen records without an uncertainty, we can calculate the distance to the nearest centroid in metres and store the number, at least if it's below some maximum distance. Is 5km a reasonable cut-off? (The cut-off has implications for interpretation speed.)

The API can then allow filtering for distanceFromCentroid > X, where X could be any value, but the portal UI can have preset values if we like.

ArthurChapman commented 1 year ago

@MattBlissett - I agree such a table is a good idea. Thinking on why we are wanting this - does anyone use a centroid for the "USA" for collections where the centroid used includes Hawaii? Not sure that they do. Australia + Tasmania - maybe - but I don't think it is common - and then you also add long outliers (Macquarie Island, Christmas Island, Norfolk Island, etc.) I don't think anyone uses a centroid for recording "Australia" that would include any of those. By putting them in a table - it might encourage people to use them and I don't think that is a good idea. [It may be a fun exercise for a geographer - but that is not our motivation]. If we get too politically correct - what about France and all their Pacific Island "territories" I think that trying to include outlying islands for all countries could be a minefield - both politically and otherwise (South China Sea). I would avoid those.

Perhaps, the only way to determine what should and should not be included is to look at what people have used - for example looking at all collections that say "Australia", "Nova Hollandia" etc. and see what has been used and include those - ignoring the many other options that no one has actually used for biological collections. Note that you have countries whose boundaries, and thus extent and centroid,have changed over time, and thus the centroid will vary with year of georeferencing.

I guess, the second use is that we would want to encourage people who are retrospectively georeferencing and wish to use a centroid , to use a consistent centroid - i.e that we provide guidance - e.g. if we have several for Australia - one may be asterisked with a recommendation that this is the recommended centroid.

jhnwllr commented 1 year ago

@MattBlissett

For now I think we should use only centroids from "countries".

type == PCLI (places with an iso-code) in this file https://github.com/jhnwllr/catalogue-of-centroids/blob/master/centroids.tsv

I believe that 5km is a great cutoff.

MattBlissett commented 1 year ago

Does distanceFromCentroidInMeters (in the GBIF term namespace) seem reasonable for this?

MattBlissett commented 1 year ago

@jhnwllr and others, would you recommend calculating this value for all occurrences, or a subset (e.g. exclude observations)?

jhnwllr commented 1 year ago

I would say that we have to calculate distanceFromCentroidInMeters for everything even if "true" centroids are usually PRESERVED_SPECIMEN.

ArthurChapman commented 1 year ago

I'd be careful excluding all observation within a proscribed distanceFromCentroidInMeters from all centroids. I know that in Australia, only one or two of the 7 or so centroids have ever been used as a default with respect to collections/observations. This would mean excluding many good records that have little or nothing to do with the centroid other than coincidence of location. Also, uncritically excluding records from centroids is a problem. I think you would have to ignore records that already have a dwc:uncertantyInMeters that is a smallish number, because, I know in my own case for example, I have deliberately collected at the centroid locations and they will have an Uncertainty of less than a 100 meters or so. Perhaps excluding records from centroids should also take into account the location or verbatimLocation. If it says "Nova Hollandia" or "Australia" and nothing else, then the centroid will likely be an artificial location, but if the location says "near Lambert's Centre of Australia" then the location is likely to be a good location and should not be excluded.

jhnwllr commented 1 year ago

@ArthurChapman Right now the centroids are merely being tagged in our backed system and there isn't a final decision on exclusion or inclusion of centroids in downloads, maps ect.

My preference would be for centroids to be treated as neutrally as possible as "interesting locations", rather than immediately assume there is a problem. The fact that a point lies near a centroid should be surfaced to users (and publishers) probably as a somewhat neutral data quality flag.

MattBlissett commented 1 year ago

@MortenHofft, the API is now deployed, so the new filter can be added to the portal.

https://api.gbif.org/v1/occurrence/search?distance_from_centroid_in_meters=*,1000&basis_of_record=PRESERVED_SPECIMEN (query for searching for data at centroids)

https://api.gbif.org/v1/occurrence/search?distance_from_centroid_in_meters=2000,*&basis_of_record=PRESERVED_SPECIMEN (more common for searching for data not at centroids)

An exact distance match on a floating point number probably isn't much use, so I suggest we default to an open range (a minimum) like the 2000,* above.

peggynewman commented 1 year ago

This is a great discussion. I'm interested in whether we can provide guidance to data publishers on how to represent these kinds of geocoding so that it's quite obvious to users how the location has been determined, for example in the dataGeneralisations field - or does coordinateUncertainty give us enough? I've had an issue pop up with iNaturalist lately where an obscured (random placement in a .2 *.2 grid) platypus record ended up in a dam that doesn't have platypus, but it took a bit of investigation to work out that it had been obscured. The record states 28km uncertainty, but the fact that it's a random placement distinguishes it from a centroid.

jhnwllr commented 5 months ago

@peggynewman I think coordinateUncertainty and footprintWKT are probably the best fields for publishers to use. The fact that it is a centroid is sort of secondary to the uncertainty problem. I have seen publishers using various fields to indicate that a point is centroid.

A free text search for centroid gives many examples: https://www.gbif.org/occurrence/search?q=centroid

For example, this record says that it is a "centroid of Sweden" but gives an uncertainty of 50m :( https://www.gbif.org/occurrence/3431110088

jhnwllr commented 5 months ago

BTW this is now implemented, so I am closing this issue. https://www.gbif.org/occurrence/map?advanced=1&distance_from_centroid_in_meters=0,0