Closed JoannaMcCaffrey closed 8 years ago
@JoannaMcCaffrey Thank you for the reference. As an issue, is there an associated task to do?
I have had a look at those tables and they look like they were generated algorithmically and retain a worrisome percentage of systemic errors. Here are a few examples:
u"eurasia": "asia", u"neotropical / central america": "north america", u"indian & n pacific": "indian ocean", u"antartic-pacific": "pacific ocean", u"arctic": "arctic ocean", u"southern": "antarctic ocean",
u"burundi/zaire": "burundi", u"new grenada": "grenada", u"india/nepal": "nepal", u"czech republic/germany/poland": "czech republic", u"french guyana": "guyana",
u"alabama-mississippi state line": "alabama", u"california/colorado/oregon/washington/wyoming/british columbia": "california", u"w. t. [=wa]": "washington",
u"france": ["europe"], u"port of spain": "spain",
Those strings with algorithmically matched, but mostly hand verified. There are some known errors in the tables, and some oversights on my part where I didn't catch close matches (the guinea* and guyana* entries are particularly troublesome to match). Since these corrections are intended to increase the discoverability of records, we err on the side of providing a single mostly useful value over exact replication/rectification of a provider string. Yes, sometimes this means we end up picking one among many values.
Of the ones you linked, only
u"new grenada": "grenada", u"french guyana": "guyana",
need fixing according to my criteria.
"port of spain" is also wrong, but mostly because it is a city name in the country field. I might delete it from the translation table.
u"france": ["europe"],
is from a different type of dictionary to the others dictionary, and labels france as a country on the continent of europe, which I don't think is in dispute (territorial holdings aside).
As for action, perhaps Joanna intended to suggest that you merge those terms into your spreadsheets. I think the two datasets have entirely different targets, so I would advise against that course of action. You can probably close this issue.
a compilation of lists: https://github.com/iDigBio/idb-backend/blob/43130d8ad65867436d8f0ba7b7ddd6ee0fc4de44/idb/data_tables/locality_data.py
comprises: 2-3 char ISO countryCode mapping 'none' options continent country stateProvince major bodies of water country string to ISO code