VertNet / DwCVocabs

Real-world values for Darwin Core terms
GNU General Public License v2.0
13 stars 9 forks source link

iDigBio DQ flags - list(s) resource #9

Closed JoannaMcCaffrey closed 8 years ago

JoannaMcCaffrey commented 8 years ago

a compilation of lists: https://github.com/iDigBio/idb-backend/blob/43130d8ad65867436d8f0ba7b7ddd6ee0fc4de44/idb/data_tables/locality_data.py

comprises: 2-3 char ISO countryCode mapping 'none' options continent country stateProvince major bodies of water country string to ISO code

tucotuco commented 8 years ago

@JoannaMcCaffrey Thank you for the reference. As an issue, is there an associated task to do?

I have had a look at those tables and they look like they were generated algorithmically and retain a worrisome percentage of systemic errors. Here are a few examples:

u"eurasia": "asia", u"neotropical / central america": "north america", u"indian & n pacific": "indian ocean", u"antartic-pacific": "pacific ocean", u"arctic": "arctic ocean", u"southern": "antarctic ocean",

u"burundi/zaire": "burundi", u"new grenada": "grenada", u"india/nepal": "nepal", u"czech republic/germany/poland": "czech republic", u"french guyana": "guyana",

u"alabama-mississippi state line": "alabama", u"california/colorado/oregon/washington/wyoming/british columbia": "california", u"w. t. [=wa]": "washington",

u"france": ["europe"], u"port of spain": "spain",

godfoder commented 8 years ago

Those strings with algorithmically matched, but mostly hand verified. There are some known errors in the tables, and some oversights on my part where I didn't catch close matches (the guinea* and guyana* entries are particularly troublesome to match). Since these corrections are intended to increase the discoverability of records, we err on the side of providing a single mostly useful value over exact replication/rectification of a provider string. Yes, sometimes this means we end up picking one among many values.

Of the ones you linked, only

u"new grenada": "grenada", u"french guyana": "guyana",

need fixing according to my criteria.

"port of spain" is also wrong, but mostly because it is a city name in the country field. I might delete it from the translation table.

u"france": ["europe"],

is from a different type of dictionary to the others dictionary, and labels france as a country on the continent of europe, which I don't think is in dispute (territorial holdings aside).

godfoder commented 8 years ago

As for action, perhaps Joanna intended to suggest that you merge those terms into your spreadsheets. I think the two datasets have entirely different targets, so I would advise against that course of action. You can probably close this issue.