Closed jhnwllr closed 9 months ago
@MortenHofft has raised the point that the data quality flag for publishers could be separated from the neutrally geocoded records. I agree with this perspective.
The data quality flag could have the following properties:
We probably want the action/menu to be "exclude within 2/5/10km of centroid".
@jhnwllr This is great work. However, I want to challenge a premise.
Users who want to filter for centroid locations, are more interested in making sure no outliers make it into their models, than false positives. So most users would rather over flag centroids.
Scientists who do not want inappropriate data in their models need to verify those data before using them. Without the explicit measure of uncertainty they can't do that, centroid or not, outlier or not. That doesn't mean there is anything wrong with your attempts to help highlight records that require review before use, but there is a MUCH simpler and all-encompassing test for that. Does the location have either a geospatial footprint (an actual geometry in the data) or uncertainty for the coordinates (as a distance)? Every record that gets used in modeling should have one of those. If the records were properly georeferenced, you wouldn't have to worry about geocoding flagging you would already have it in coordinateUncertaintyInMeters. Another focus of effort could be to provide the coordinates and uncertainty for records that do not have coordinates, but do have unambiguous administrative geography, to the highest specificity you can.
Independently, keep in mind that there are also many different kinds of centroids (e.g., Australia has at least five), and your proposal covers just one of those.
There are a number of issues with (country) centroids not discussed above. I agree largely with what @tucotuco has said.
However, if we take Australia. Many early records for Australia just say "Nova Hollandia" or "Australiia" and a default has been placed at the "center" of Australia [but see below]. When these observations were made, the only European discovery was around the Australian coast - and definitely not in the desert in central Australia. A better representation would be a footprint derived by buffering the coast (or in many cases, just the east coast around Sydney/Botany Bay, etc.) One ends up with a record as the centroid which is in the Australian desert for an observation from the wet coast.
Just to illustrate this point, there is a record of a marine starfish on GBIF at -25.274, 133.775 (middle of Australia's desert) - which in itself is an interesting centroid that I think comes from the CIA database and for which I have not worked out the scientific basis for.
The determination of the centroid of a country can be carried out using a number of methods - none are wrong, just determined using different methods. In the case of Australia, this can result in differences of hundreds of kilometers.
Nova Hollandia" or "Australia" on ALA
-25.274, 133.775
(24 records on ALA within 1 km)
Centre of Gravity Method
23° 07' South, 132° 08' (-23.1166667, 132.33333)
(0 records on ALA within 1 km)
Lambert Gravitational Centre
25° 36' 36.4"S, 134° 21' 17.3"E (-25.610111, 134.3548056)
(Lots of records within 1km - also it comes up on the map on ALA about 300 m from where Geosciences Australia places it!) – Datum problem?
Furthest Point from Coastline
23° 02'S, 132° 10'E (-23.033333, 132.1666667)
(0 records on ALA within 1 km)
Geodetic Median Point
23° 33' 09.89"S, 133° 23' 46.00"E (-23.5527472, 133.396111)
(8 records on ALA within 1 km)
Johnston Geodetic Station
25° 56' 49.3"S, 133° 12' 34.7"E (-25.9470278, 133.2096389)
(lots of records on ALA within 1 km)
Notethat a lot of records from both Lamberts Centre and from Johnson's Geodetic Centre are actual observations from those places (I have made recent observations there myself with a determined uncertainty) - i.e. high precision records. Many, however, are defaults of "Australia"
So, in summary, there are several issues.
What is "Australia"
How was the center (centroid) determined
There are similar issues with the Australian States and determining the centroids there.
Scientists who do not want inappropriate data in their models need to verify those data before using them. Without the explicit measure of uncertainty they can't do that, centroid or not, outlier or not
Records without an uncertainty stated: 1,373,034,036 (out of 2,104,374,546 with coordinates)
A flag that the record is missing uncertainty might be useful in that case since ≈2/3 are missing it.
I think it would be useful, but also make sure not to flag it if there is a footprintWKT provided that is not just a POINT.
I have taken the time to extract centroids for iso2 places from the Getty Thesaurus of Geographic Names (TGN). I have pasted them all here for reference.
The TGN is a source of many centroids on GBIF.
Interestingly, we are still missing the centroids that @ArthurChapman points to for Australia. Extracting these centroids is already a bit dodgy, so source="Arthur Chapman" might a good solution for AU centroids.
The point about dwc:coordinateUncertaintyInMeters that @tucotuco raises is totally correct. But I think we can still geocode centroids while also encouraging publishers to fill in this very important field.
iso2 | tgn_name | lat | lon | n_specimen | n_hobservation | source | centroid_2km |
---|---|---|---|---|---|---|---|
MX | Mexico | 23 | -102 | 18648 | 16 | TGN | link%20) |
PH | Philippines | 13 | 122 | 7604 | 6 | TGN | link%20) |
SG | Singapore | 1.3667 | 103.8 | 7026 | 35108 | TGN | link%20) |
MU | Mauritius | -20.3 | 57.5833 | 5947 | 28 | TGN | link%20) |
NF | Norfolk Island | -29.033333 | 167.95 | 5553 | 6872 | TGN | link%20) |
SE | Sweden | 62 | 15 | 5000 | 255 | TGN | link%20) |
CU | Cuba | 21.5 | -80 | 4748 | 0 | TGN | link%20) |
CH | Switzerland | 47 | 8 | 4743 | 23677 | TGN | link%20) |
FR | France | 46 | 2 | 4573 | 1984 | TGN | link%20) |
JP | Japan | 36 | 138 | 4313 | 81 | TGN | link%20) |
GY | Guyana | 5 | -59 | 4180 | 0 | TGN | link%20) |
DE | Germany | 51.5 | 10.5 | 3769 | 18 | TGN | link%20) |
FJ | Fiji | -18 | 178 | 3162 | 4 | TGN | link%20) |
NC | New Caledonia | -21.5 | 165.5 | 3101 | 9 | TGN | link%20) |
BR | Brazil | -10 | -55 | 2929 | 2 | TGN | link%20) |
AU | Australia | -25 | 135 | 2874 | 9 | TGN | link%20) |
BM | Bermuda | 32.3333 | -64.75 | 2676 | 9249 | TGN | link%20) |
CX | Christmas Island | -10.5 | 105.6667 | 2645 | 226 | TGN | link%20) |
DK | Denmark | 56 | 10 | 2633 | 4452 | TGN | link%20) |
CD | Democratic Republic of the Congo | -.0167 | 25 | 2585 | 0 | TGN | link%20) |
CR | Costa Rica | 10 | -84 | 2495 | 1154 | TGN | link%20) |
US | United States | 38 | -98 | 2370 | 3714 | TGN | link%20) |
BS | Bahamas | 24 | -76 | 2276 | 0 | TGN | link%20) |
JM | Jamaica | 18.25 | -77.5 | 2206 | 24 | TGN | link%20) |
BN | Brunei Darussalam | 4.5 | 114.6667 | 2130 | 0 | TGN | link%20) |
PA | Panama | 9 | -80 | 2037 | 0 | TGN | link%20) |
MG | Madagascar | -20 | 47 | 2025 | 0 | TGN | link%20) |
LU | Luxembourg | 49.75 | 6.1667 | 2019 | 4737 | TGN | link%20) |
IT | Italy | 42.8333 | 12.8333 | 2005 | 1 | TGN | link%20) |
CN | China | 35 | 105 | 1978 | 0 | TGN | link%20) |
FO | Faeroe Islands | 62 | -7 | 1939 | 81 | TGN | link%20) |
AW | Aruba | 12.5 | -69.9667 | 1837 | 3045 | TGN | link%20) |
CW | Curacao | 12.166 | -69 | 1799 | 11517 | TGN | link%20) |
NZ | New Zealand | -42 | 174 | 1685 | 910 | TGN | link%20) |
PG | Papua New Guinea | -6 | 147 | 1636 | 0 | TGN | link%20) |
HU | Hungary | 47 | 20 | 1623 | 123 | TGN | link%20) |
LK | Sri Lanka | 7 | 81 | 1620 | 7 | TGN | link%20) |
ZA | South Africa | -30 | 26 | 1579 | 0 | TGN | link%20) |
BB | Barbados | 13.1667 | -59.5333 | 1559 | 1598 | TGN | link%20) |
SB | Solomon Islands | -8 | 159 | 1500 | 0 | TGN | link%20) |
TN | Tunisia | 34 | 9 | 1397 | 4 | TGN | link%20) |
DO | Dominican Republic | 19 | -70.6667 | 1338 | 0 | TGN | link%20) |
CA | Canada | 60 | -96 | 1326 | 0 | TGN | link%20) |
AD | Andorra | 42.55 | 1.583 | 1308 | 3238 | TGN | link%20) |
GT | Guatemala | 15.5 | -90.25 | 1287 | 80 | TGN | link%20) |
MQ | Martinique | 14.6667 | -61 | 1284 | 103 | TGN | link%20) |
EC | Ecuador | -2 | -77.5 | 1256 | 0 | TGN | link%20) |
AT | Austria | 47.3333 | 13.3333 | 1206 | 324 | TGN | link%20) |
CZ | Czech Republic | 49.75 | 15 | 1206 | 38 | TGN | link%20) |
GP | Guadeloupe | 16.25 | -61.5833 | 1203 | 3547 | TGN | link%20) |
BZ | Belize | 17.25 | -88.75 | 1201 | 18302 | TGN | link%20) |
IN | India | 20 | 77 | 1190 | 2 | TGN | link%20) |
MS | Montserrat | 16.75 | -62.2 | 1141 | 3798 | TGN | link%20) |
TZ | Tanzania | -6 | 35 | 1138 | 663 | TGN | link%20) |
KE | Kenya | 1 | 38 | 1132 | 11 | TGN | link%20) |
FI | Finland | 64 | 26 | 1092 | 525 | TGN | link%20) |
RU | Russia | 60 | 47 | 1070 | 0 | TGN | link%20) |
HK | Hong Kong | 22.25 | 114.1667 | 1066 | 6839 | TGN | link%20) |
BE | Belgium | 50.8333 | 4 | 1009 | 12011 | TGN | link%20) |
SC | Seychelles | -4.5833 | 55.6667 | 977 | 4 | TGN | link%20) |
CO | Colombia | 4 | -72 | 963 | 2 | TGN | link%20) |
ID | Indonesia | -5 | 120 | 952 | 0 | TGN | link%20) |
RE | Reunion | -21.1 | 55.6 | 927 | 2401 | TGN | link%20) |
CL | Chile | -30 | -71 | 890 | 4852 | TGN | link%20) |
GL | Greenland | 72 | -40 | 865 | 0 | TGN | link%20) |
PE | Peru | -10 | -76 | 857 | 3 | TGN | link%20) |
TT | Trinidad and Tobago | 11 | -61 | 850 | 0 | TGN | link%20) |
SN | Senegal | 14 | -14 | 806 | 12 | TGN | link%20) |
IS | Iceland | 65 | -18 | 772 | 189 | TGN | link%20) |
HT | Haiti | 19 | -72.4167 | 761 | 0 | TGN | link%20) |
GD | Grenada | 12.1167 | -61.6667 | 757 | 701 | TGN | link%20) |
ES | Spain | 40 | -4 | 753 | 184 | TGN | link%20) |
TH | Thailand | 15 | 100 | 751 | 23 | TGN | link%20) |
KN | Saint Kitts and Nevis | 17.3333 | -62.75 | 743 | 400 | TGN | link%20) |
MT | Malta | 35.9167 | 14.4167 | 731 | 1603 | TGN | link%20) |
PY | Paraguay | -23 | -58 | 721 | 0 | TGN | link%20) |
PN | Pitcairn Islands | -25.0667 | -130.1 | 721 | 105 | TGN | link%20) |
EE | Estonia | 59 | 26 | 720 | 1147 | TGN | link%20) |
MC | Monaco | 43.7333 | 7.4167 | 713 | 4454 | TGN | link%20) |
JE | Jersey | 49.2167 | -2.1167 | 696 | 2053 | TGN | link%20) |
DZ | Algeria | 28 | 3 | 674 | 0 | TGN | link%20) |
NO | Norway | 62 | 10 | 620 | 535 | TGN | link%20) |
GR | Greece | 39 | 22 | 588 | 19 | TGN | link%20) |
RO | Romania | 46 | 25 | 586 | 136 | TGN | link%20) |
MN | Mongolia | 46 | 105 | 576 | 1 | TGN | link%20) |
SR | Suriname | 4 | -56 | 566 | 0 | TGN | link%20) |
DM | Dominica | 15.5 | -61.3333 | 558 | 3 | TGN | link%20) |
TW | Taiwan | 24 | 121 | 547 | 20602 | TGN | link%20) |
CV | Cape Verde | 16 | -24 | 542 | 0 | TGN | link%20) |
CM | Cameroon | 6 | 12 | 539 | 0 | TGN | link%20) |
VE | Venezuela | 8 | -66 | 523 | 0 | TGN | link%20) |
SK | Slovakia | 48.6667 | 19.5 | 507 | 3 | TGN | link%20) |
EG | Egypt | 27 | 30 | 499 | 0 | TGN | link%20) |
LI | Liechtenstein | 47.1667 | 9.5333 | 496 | 8749 | TGN | link%20) |
GI | Gibraltar | 36.1333 | -5.35 | 469 | 8711 | TGN | link%20) |
BL | Saint BarthÚlemy | 17.9 | -62.833 | 459 | 5430 | TGN | link%20) |
AG | Antigua and Barbuda | 17.05 | -61.8 | 444 | 138 | TGN | link%20) |
CI | C¶te d'Ivoire | 8 | -5 | 423 | 0 | TGN | link%20) |
MP | Northern Mariana Islands | 15.213 | 145.755 | 422 | 1290 | TGN | link%20) |
SH | Saint Helena, Ascension and Tristan da Cunha | -15.95 | -5.7 | 395 | 82 | TGN | link%20) |
PT | Portugal | 39.5 | -8 | 391 | 2132 | TGN | link%20) |
YT | Mayotte | -12.8333 | 45.1667 | 386 | 309 | TGN | link%20) |
PL | Poland | 52 | 20 | 380 | 242 | TGN | link%20) |
MH | Marshall Islands | 10 | 167 | 380 | 0 | TGN | link%20) |
AM | Armenia | 40 | 45 | 373 | 9 | TGN | link%20) |
BO | Bolivia | -17 | -65 | 367 | 1 | TGN | link%20) |
AI | Anguilla | 18.2167 | -63.05 | 361 | 1668 | TGN | link%20) |
IL | Israel | 31.5 | 34.75 | 359 | 125 | TGN | link%20) |
SY | Syria | 35 | 38 | 359 | 0 | TGN | link%20) |
BW | Botswana | -22 | 24 | 348 | 0 | TGN | link%20) |
MW | Malawi | -13.5 | 34 | 346 | 0 | TGN | link%20) |
KR | South Korea | 37 | 127.5 | 341 | 90 | TGN | link%20) |
AR | Argentina | -34 | -64 | 336 | 10 | TGN | link%20) |
MZ | Mozambique | -18.25 | 35 | 330 | 0 | TGN | link%20) |
LC | Saint Lucia | 13.8833 | -60.9667 | 330 | 190 | TGN | link%20) |
MM | Myanmar | 22 | 98 | 326 | 0 | TGN | link%20) |
ZW | Zimbabwe | -19 | 29 | 288 | 0 | TGN | link%20) |
SL | Sierra Leone | 8.5 | -11.5 | 286 | 0 | TGN | link%20) |
VN | Viet Nam | 16 | 106 | 280 | 0 | TGN | link%20) |
LR | Liberia | 6.5 | -9.5 | 278 | 0 | TGN | link%20) |
GU | Guam | 13.4667 | 144.8333 | 268 | 385 | TGN | link%20) |
KM | Comoros | -12.1667 | 44.25 | 255 | 1 | TGN | link%20) |
UG | Uganda | 2 | 33 | 253 | 1 | TGN | link%20) |
KY | Cayman Islands | 19.5 | -80.6667 | 245 | 0 | TGN | link%20) |
VC | Saint Vincent and the Grenadines | 13.0833 | -61.2 | 243 | 10 | TGN | link%20) |
NR | Nauru | -.5333 | 166.9167 | 243 | 1003 | TGN | link%20) |
IR | Iran | 32 | 53 | 240 | 53 | TGN | link%20) |
GA | Gabon | -1 | 11.75 | 225 | 2 | TGN | link%20) |
GE | Georgia | 42 | 43.5 | 223 | 32 | TGN | link%20) |
UY | Uruguay | -33 | -56 | 222 | 0 | TGN | link%20) |
NL | Netherlands | 52.5 | 5.75 | 219 | 1808 | TGN | link%20) |
MY | Malaysia | 2.5 | 112.5 | 217 | 0 | TGN | link%20) |
VU | Vanuatu | -16 | 167 | 212 | 1 | TGN | link%20) |
VG | British Virgin Islands | 18.5 | -64.5 | 212 | 1 | TGN | link%20) |
TR | Turkey | 39 | 35 | 210 | 7 | TGN | link%20) |
NG | Nigeria | 10 | 8 | 210 | 0 | TGN | link%20) |
NI | Nicaragua | 13 | -85 | 205 | 0 | TGN | link%20) |
PW | Palau | 6 | 134 | 200 | 13 | TGN | link%20) |
NP | Nepal | 28 | 84 | 198 | 0 | TGN | link%20) |
BJ | Benin | 9.5 | 2.25 | 197 | 33 | TGN | link%20) |
PK | Pakistan | 30 | 70 | 192 | 0 | TGN | link%20) |
BD | Bangladesh | 24 | 90 | 191 | 2 | TGN | link%20) |
HR | Croatia | 45.1667 | 15.5 | 191 | 4 | TGN | link%20) |
HN | Honduras | 15 | -86.5 | 188 | 0 | TGN | link%20) |
NA | Namibia | -22 | 17 | 187 | 7 | TGN | link%20) |
YE | Yemen | 15.5 | 47.5 | 174 | 0 | TGN | link%20) |
WF | Wallis and Futuna Islands | -13.3 | -176.2 | 168 | 84 | TGN | link%20) |
MA | Morocco | 32 | -5 | 167 | 8 | TGN | link%20) |
GH | Ghana | 8 | -2 | 163 | 6 | TGN | link%20) |
SA | Saudi Arabia | 25 | 45 | 162 | 0 | TGN | link%20) |
AO | Angola | -12.5 | 18.5 | 154 | 0 | TGN | link%20) |
TO | Tonga | -20 | -175 | 150 | 19 | TGN | link%20) |
CY | Cyprus | 35 | 33 | 133 | 124 | TGN | link%20) |
KH | Cambodia | 13 | 105 | 127 | 0 | TGN | link%20) |
SD | Sudan | 16 | 30 | 126 | 0 | TGN | link%20) |
ET | Ethiopia | 8 | 39 | 122 | 0 | TGN | link%20) |
VA | Holy See | 41.903 | 12.453 | 119 | 9301 | TGN | link%20) |
GF | French Guiana | 4 | -53 | 116 | 365 | TGN | link%20) |
IQ | Iraq | 33 | 44 | 115 | 0 | TGN | link%20) |
NU | Niue | -19.0333 | -169.8667 | 114 | 11 | TGN | link%20) |
MO | Macau | 22.1667 | 113.55 | 106 | 2072 | TGN | link%20) |
KP | North Korea | 40 | 127 | 105 | 0 | TGN | link%20) |
PF | French Polynesia | -15 | -140 | 105 | 0 | TGN | link%20) |
GM | The Gambia | 13.5 | -15.5 | 104 | 45 | TGN | link%20) |
OM | Oman | 21 | 57 | 102 | 0 | TGN | link%20) |
UA | Ukraine | 49 | 32 | 102 | 0 | TGN | link%20) |
IE | Ireland | 53 | -8 | 94 | 158 | TGN | link%20) |
GN | Guinea | 11 | -10 | 87 | 0 | TGN | link%20) |
ZM | Zambia | -15 | 30 | 85 | 0 | TGN | link%20) |
FK | Falkland Islands | -51.75 | -59 | 83 | 0 | TGN | link%20) |
MV | Maldives | 3.2 | 73 | 77 | 0 | TGN | link%20) |
BA | Bosnia and Herzegovina | 44.25 | 17.8333 | 73 | 0 | TGN | link%20) |
TC | Turks and Caicos Islands | 21.7333 | -71.5833 | 69 | 0 | TGN | link%20) |
KW | Kuwait | 29.5 | 47.75 | 65 | 0 | TGN | link%20) |
SZ | Swaziland | -26.5 | 31.5 | 65 | 2 | TGN | link%20) |
SV | El Salvador | 13.8333 | -88.9167 | 65 | 1 | TGN | link%20) |
SM | San Marino | 43.9333 | 12.4167 | 64 | 109 | TGN | link%20) |
BT | Bhutan | 27.5 | 90.5 | 64 | 4922 | TGN | link%20) |
LB | Lebanon | 33.8333 | 35.8333 | 60 | 3 | TGN | link%20) |
BV | Bouvet Island | -54.4333 | 3.4 | 57 | 17 | TGN | link%20) |
PS | Gaza Strip | 31.4167 | 34.3333 | 56 | 75 | TGN | link%20) |
AF | Afghanistan | 33 | 65 | 55 | 0 | TGN | link%20) |
WS | Samoa | -13.8 | -172.133333 | 53 | 35 | TGN | link%20) |
LV | Latvia | 57 | 25 | 53 | 6 | TGN | link%20) |
GW | Guinea-Bissau | 12 | -15 | 52 | 22 | TGN | link%20) |
LA | Laos | 18 | 105 | 49 | 0 | TGN | link%20) |
GQ | Equatorial Guinea | 2 | 10 | 49 | 0 | TGN | link%20) |
DJ | Djibouti | 11.5 | 42.5 | 42 | 0 | TGN | link%20) |
CF | Central African Republic | 7 | 21 | 41 | 0 | TGN | link%20) |
CG | Congo | -1 | 15 | 41 | 0 | TGN | link%20) |
KI | Kiribati | -5 | -170 | 40 | 0 | TGN | link%20) |
NE | Niger | 16 | 8 | 38 | 1 | TGN | link%20) |
ER | Eritrea | 15 | 39 | 38 | 0 | TGN | link%20) |
AL | Albania | 41 | 20 | 38 | 4 | TGN | link%20) |
PS | State of Palestine | 31.92157 | 35.20329 | 38 | 640 | TGN | link%20) |
SO | Somalia | 6 | 48 | 35 | 0 | TGN | link%20) |
PM | Saint Pierre and Miquelon | 46.8333 | -56.3333 | 31 | 2 | TGN | link%20) |
IM | Isle of Man | 54.25 | -4.5 | 29 | 3438 | TGN | link%20) |
UZ | Uzbekistan | 41 | 64 | 29 | 0 | TGN | link%20) |
TJ | Tajikistan | 39 | 71 | 28 | 0 | TGN | link%20) |
MK | North Macedonia | 41.666 | 21.75 | 26 | 61 | TGN | link%20) |
KZ | Kazakhstan | 48 | 68 | 24 | 0 | TGN | link%20) |
LY | Libya | 25 | 17 | 22 | 0 | TGN | link%20) |
RW | Rwanda | -2 | 30 | 21 | 26 | TGN | link%20) |
TM | Turkmenistan | 40 | 60 | 21 | 0 | TGN | link%20) |
ML | Mali | 17 | -4 | 20 | 1 | TGN | link%20) |
CC | Cocos Islands | -12 | 96.8333 | 17 | 4 | TGN | link%20) |
LT | Lithuania | 56 | 24 | 17 | 3 | TGN | link%20) |
KG | Kyrgyzstan | 41 | 75 | 17 | 0 | TGN | link%20) |
BF | Burkina Faso | 13 | -2 | 16 | 0 | TGN | link%20) |
SJ | Svalbard | 78 | 20 | 16 | 404 | TGN | link%20) |
ME | Montenegro | 42.5 | 19.3333 | 16 | 23 | TGN | link%20) |
ST | Sao Tome and Principe | 1 | 7 | 15 | 0 | TGN | link%20) |
TL | Timor-Leste | -8.5833 | 126 | 14 | 1 | TGN | link%20) |
AZ | Azerbaijan | 40.5 | 47.5 | 14 | 0 | TGN | link%20) |
SI | Slovenia | 46.083 | 15 | 14 | 15 | TGN | link%20) |
JO | Jordan | 31 | 36 | 13 | 1 | TGN | link%20) |
FM | Federated States of Micronesia | 5 | 152 | 12 | 0 | TGN | link%20) |
AS | American Samoa | -14.3167 | -170.5 | 12 | 0 | TGN | link%20) |
TG | Togo | 8 | 1.1667 | 12 | 0 | TGN | link%20) |
GS | South Georgia and South Sandwich Islands | -56 | -33 | 11 | 0 | TGN | link%20) |
TV | Tuvalu | -8 | 178 | 10 | 0 | TGN | link%20) |
LS | Lesotho | -29.5 | 28.25 | 10 | 0 | TGN | link%20) |
PS | West Bank | 32 | 35.25 | 10 | 118 | TGN | link%20) |
BI | Burundi | -3.5 | 30 | 9 | 0 | TGN | link%20) |
GB | United Kingdom | 54 | -4.5 | 9 | 162 | TGN | link%20) |
TF | French Southern and Antarctic Lands | -43 | 67 | 8 | 1 | TGN | link%20) |
TD | Chad | 15 | 19 | 7 | 0 | TGN | link%20) |
BG | Bulgaria | 42.666 | 25.25 | 6 | 37 | TGN | link%20) |
BH | Bahrain | 26 | 50.5 | 6 | 267 | TGN | link%20) |
AE | United Arab Emirates | 24 | 54 | 5 | 0 | TGN | link%20) |
TK | Tokelau | -9 | -171.75 | 4 | 0 | TGN | link%20) |
MD | Moldova | 47.25 | 28.583 | 4 | 18 | TGN | link%20) |
BY | Belarus | 53 | 28 | 4 | 0 | TGN | link%20) |
GG | Guernsey | 49.5833 | -2.333 | 3 | 0 | TGN | link%20) |
IO | British Indian Ocean Territory | -7 | 72.0167 | 3 | 0 | TGN | link%20) |
SS | South Sudan | 7.5 | 30 | 3 | 0 | TGN | link%20) |
QA | Qatar | 25.5 | 51.25 | 2 | 1 | TGN | link%20) |
MR | Mauritania | 20 | -12 | 2 | 1 | TGN | link%20) |
HM | Heard Island and McDonald Islands | -53 | 73 | 0 | 0 | TGN | link%20) |
IN | Bassas da India | -21.4167 | 39.7 | 0 | 0 | TGN | link%20) |
CK | Cook Islands | -16.083 | -161.583 | 0 | 0 | TGN | link%20) |
RS | Serbia | 44.166 | 20.833 | 0 | 0 | TGN | link%20) |
https://github.com/gbif/geocode/blob/e1609c922f840939d9ccecf0ce8b1ef9a473f019/database/geolocate_centroids.sql https://github.com/gbif/geocode/blob/e1609c922f840939d9ccecf0ce8b1ef9a473f019/database/coordinatecleaner_centroids.sql
Further to my earlier post, and comments by @jhnwllr above - of the five centroids for Australia, there are two common ones that have been used for specimens and observations in the past.
The first is Lamberts Gravitational Centre (see reference at https://www.atlasobscura.com/places/lambert-centre-of-australia).
The second is Johnston's Geodetic Centre (see reference at https://www.xnatmap.org/adnm/docs/2013/1965%20JGS2.htm which also discusses how this and other "centres" were calculated). This latter paper shows the complexities in determining country/continental centroids.
For those interested - there is a paper here on the five Australian Centroids plus centroids for each of the Australian States and Territories (https://www.ga.gov.au/scientific-topics/national-location-information/dimensions/centre-of-australia-states-territories)
I decided that collecting these centroids needed more organization, so I made repo to aggregate different centroid sources into one source. https://github.com/jhnwllr/catalogue-of-centroids
@MattBlissett
Great job @jhnwllr. Wouldn't be great to have the detailed methodology for each of the centroids. I know, from looking at Australia's, detailed methodologies are very difficult to find. Looking at the Australian ones, interesting that you found 9. I guess some of the more southern ones include Tasmania, whereas many of the others are for mainland Australia. Good job.
To implement this, I think we should have a table of all reasonable centroids (Lambert's or Johnston's or geolocate or TCN or anyone else's method) for countries and country-like things (Australia with and without Tasmania, the UK with and without Shetland, USA with and without Alaska and Hawaii etc). (Some are already on the debug map -- NB some layers may crash a browser, but the centroid layer is fine.)
During interpretation, either for all records, or records without an uncertainty, or specimen records without an uncertainty, we can calculate the distance to the nearest centroid in metres and store the number, at least if it's below some maximum distance. Is 5km a reasonable cut-off? (The cut-off has implications for interpretation speed.)
The API can then allow filtering for distanceFromCentroid > X, where X could be any value, but the portal UI can have preset values if we like.
@MattBlissett - I agree such a table is a good idea. Thinking on why we are wanting this - does anyone use a centroid for the "USA" for collections where the centroid used includes Hawaii? Not sure that they do. Australia + Tasmania - maybe - but I don't think it is common - and then you also add long outliers (Macquarie Island, Christmas Island, Norfolk Island, etc.) I don't think anyone uses a centroid for recording "Australia" that would include any of those. By putting them in a table - it might encourage people to use them and I don't think that is a good idea. [It may be a fun exercise for a geographer - but that is not our motivation]. If we get too politically correct - what about France and all their Pacific Island "territories" I think that trying to include outlying islands for all countries could be a minefield - both politically and otherwise (South China Sea). I would avoid those.
Perhaps, the only way to determine what should and should not be included is to look at what people have used - for example looking at all collections that say "Australia", "Nova Hollandia" etc. and see what has been used and include those - ignoring the many other options that no one has actually used for biological collections. Note that you have countries whose boundaries, and thus extent and centroid,have changed over time, and thus the centroid will vary with year of georeferencing.
I guess, the second use is that we would want to encourage people who are retrospectively georeferencing and wish to use a centroid , to use a consistent centroid - i.e that we provide guidance - e.g. if we have several for Australia - one may be asterisked with a recommendation that this is the recommended centroid.
@MattBlissett
For now I think we should use only centroids from "countries".
type == PCLI (places with an iso-code) in this file https://github.com/jhnwllr/catalogue-of-centroids/blob/master/centroids.tsv
I believe that 5km is a great cutoff.
Does distanceFromCentroidInMeters
(in the GBIF term namespace) seem reasonable for this?
@jhnwllr and others, would you recommend calculating this value for all occurrences, or a subset (e.g. exclude observations)?
I would say that we have to calculate distanceFromCentroidInMeters
for everything even if "true" centroids are usually PRESERVED_SPECIMEN.
I'd be careful excluding all observation within a proscribed distanceFromCentroidInMeters from all centroids. I know that in Australia, only one or two of the 7 or so centroids have ever been used as a default with respect to collections/observations. This would mean excluding many good records that have little or nothing to do with the centroid other than coincidence of location. Also, uncritically excluding records from centroids is a problem. I think you would have to ignore records that already have a dwc:uncertantyInMeters that is a smallish number, because, I know in my own case for example, I have deliberately collected at the centroid locations and they will have an Uncertainty of less than a 100 meters or so. Perhaps excluding records from centroids should also take into account the location or verbatimLocation. If it says "Nova Hollandia" or "Australia" and nothing else, then the centroid will likely be an artificial location, but if the location says "near Lambert's Centre of Australia" then the location is likely to be a good location and should not be excluded.
@ArthurChapman Right now the centroids are merely being tagged in our backed system and there isn't a final decision on exclusion or inclusion of centroids in downloads, maps ect.
My preference would be for centroids to be treated as neutrally as possible as "interesting locations", rather than immediately assume there is a problem. The fact that a point lies near a centroid should be surfaced to users (and publishers) probably as a somewhat neutral data quality flag.
@MortenHofft, the API is now deployed, so the new filter can be added to the portal.
https://api.gbif.org/v1/occurrence/search?distance_from_centroid_in_meters=*,1000&basis_of_record=PRESERVED_SPECIMEN (query for searching for data at centroids)
https://api.gbif.org/v1/occurrence/search?distance_from_centroid_in_meters=2000,*&basis_of_record=PRESERVED_SPECIMEN (more common for searching for data not at centroids)
An exact distance match on a floating point number probably isn't much use, so I suggest we default to an open range (a minimum) like the 2000,*
above.
This is a great discussion. I'm interested in whether we can provide guidance to data publishers on how to represent these kinds of geocoding so that it's quite obvious to users how the location has been determined, for example in the dataGeneralisations field - or does coordinateUncertainty give us enough? I've had an issue pop up with iNaturalist lately where an obscured (random placement in a .2 *.2 grid) platypus record ended up in a dam that doesn't have platypus, but it took a bit of investigation to work out that it had been obscured. The record states 28km uncertainty, but the fact that it's a random placement distinguishes it from a centroid.
@peggynewman I think coordinateUncertainty and footprintWKT are probably the best fields for publishers to use. The fact that it is a centroid is sort of secondary to the uncertainty problem. I have seen publishers using various fields to indicate that a point is centroid.
A free text search for centroid gives many examples: https://www.gbif.org/occurrence/search?q=centroid
For example, this record says that it is a "centroid of Sweden" but gives an uncertainty of 50m :( https://www.gbif.org/occurrence/3431110088
BTW this is now implemented, so I am closing this issue. https://www.gbif.org/occurrence/map?advanced=1&distance_from_centroid_in_meters=0,0
background
Sometimes data publishers will not know the exact lat-lon location of a record and will enter the lat-long center of the locality instead. This is a data issue because users might be unaware that an observation is pinned to a locality center and assume it is a precise location.
From previous work it is known that most centroids are coming from museum collections (basisOfRecord=PRESERVED_SPECIMEN).
false positives problem
@MattBlissett has pointed out that in some cases, if we were to flag records, we would end up flagging many "non-centroid" false positives.
In the figure, the UK centroid where many non-centroid human observations are mixed with fewer "real centroids" from likely retrospectively geo-coded records. Many museum records are sitting directly on the centroid, but as a user you are probably also concerned with the few museum records somewhat further away from the centroid. (rings are 2km and 5km buffers)
Publishers would probably not like to have many records flagged that just happen to be near centroids.
Of course, for some centroids this isn't a problem at all.
users vs publishers
Users who want to filter for centroid locations, are more interested in making sure no outliers make it into their models, than false positives. So most users would rather over flag centroids.
Publishers would rather we be more judicious with flagging, so their datasets don't get littered with false positives.
This is why I recommend treating centroids more neutrally as a geocoded location rather than a data quality flag.
data quality flag vs location filter
Fake potential UI below.
In my view, thinking about centroids as useful, interesting neutral locations rather than as data quality problems/flags makes centroids easier to work with. Since we are never going to eliminate all false positives, it makes more sense to treat centroids as locations. This becomes even more apparent when we start talking about province and state centroids, which are also useful locations, but will produce even more "false positives".
There are some really small provinces but it would still be useful to geocode the centroids.
One disadvantage to treating centroids as simply geocoded locations would be that we might need to include an additional column in downloads to make it useful for users. Also it is difficult to filter out unwanted records with the current interface.
Below you can review 30 sampled centroids for iso2 places over 30K sqkm. It is usually impossible to tell if a point on a centroid is a "real" centroid but usually if a preserved specimen is somewhat close to known centroid, it is highly likely to be a "real centroid".
This brain dump are my current thoughts. Open to any divergent opinions or discussion.
@timrobertson100 @ahahn-gbif @MattBlissett