Canadensys / narwhal-processor

Basic data processing library aiming to normalize similar values ​​in a known format.
MIT License
6 stars 7 forks source link

Populate internal country database using sementic web #14

Open cgendreau opened 11 years ago

cgendreau commented 11 years ago

Would be interesting to expand the narwhal to be able to build an up-to-date and well-maintained knowledge base of country names, their alternative representations (possibly multilingual) and mappings to known misspellings using linked open data (semantic Web).

This could be done using a semantic Web URI. Something like : http://dbpedia.org/page/Category:Member_states_of_the_United_Nations

A country could than be identified with a URI such as http://dbpedia.org/resource/Canada The name of a country in different languages could populated using "owl:sameAs". The known misspellings could be handle using SKOS.

For performance reasons, we'd like this thesaurus to be embedded in the library, but with the capacity to be periodically refreshed with data pulled from external resources (like it's currently the case through the gbif-parser).

Benefits:

rdmpage commented 11 years ago

Have you had a look at GeoNames ? Lots of Semantic Web goodness if that's your thing, see http://www.geonames.org/ontology/documentation.html

tucotuco commented 11 years ago

As sources for names and synonyms, there are also The Getty Thesaurus of Geographic Names (http://www.getty.edu/vow/TGNSearchPage.jsp), and GADM (http://www.gadm.org/).

For misspellings, I have accumulated nearly 5000 variants on values mapped to the Darwin Core term country and have provided the corresponding ISO 3166-2 country code for all of the ones for which that is possible. This list is growing as we pass additional data through validation for VertNet.

peterdesmet commented 11 years ago

Just stumbled upon this tool: http://okfnlabs.org/blog/2013/05/16/nomenklatura-matching-service-reconciliation-made-easy.html Might be of help here.

cgendreau commented 11 years ago

I think it is worth mentioning : http://community.gbif.org/pg/file/read/34059/