cgendreau / common-library-proposal

Common biodiversity data library proposal
4 stars 0 forks source link

Country name parsing #3

Open cgendreau opened 12 years ago

cgendreau commented 12 years ago

From any 'string' representing a country, we shall be able to get the official name of the country. A dictionary based approach like GBIF does in https://code.google.com/p/gbif-common-resources/source/browse/gbif-parsers works very well. Actually this library could be reused.

lfrancke commented 12 years ago

That approach works remarkably well for its simplicity indeed but we are running into problems.

Someone providing "Borneo" for example. We used to map this to Indonesia but it could in fact be either Malaysia, Brunei or Indonesia. There's a whole bunch of old country codes that might map to two new ones etc. We're also getting localized country names and are not vey good at dealing with those etc. so there's definitely room for improvement.

That's where the boundary between validation and verification gets blurry. We usually try to take other things from an occurrence into account (like coordinates) but it's hard to get good shapefiles and so on....

cgendreau commented 12 years ago

I think with only Borneo or USSR there is nothing more we can do. But like you said, it is still possible find the country from the coordinates (if available) or eventually maybe the locality. For a first iteration, I would suggest to not parsed country name that we can not mapped directly. Organizations could still do a second pass and use their own strategy. In a second iteration, it would be possible to do cross validations and then, we could take the coordinates into account.

peterdesmet commented 11 years ago

I assume some of this is now implemented in the Narwhal Processor? https://github.com/Canadensys/narwhal-processor/wiki/CountryProcessor Can this issue be closed?