datasets / country-codes

Comprehensive country code information, including ISO 3166 codes, ITU dialing codes, ISO 4217 currency codes, and many others
https://datahub.io/core/country-codes
885 stars 575 forks source link

Include column for Wikidata identifier, suggestion #53

Open ppKrauss opened 7 years ago

ppKrauss commented 7 years ago

Wikipedia have stable pages for all countries, and Wikidata supply an ID for it. Today Wikidata IDs are playing important role as "concept identifier", for Web Semantic in general and for open projects like OpenStreetMaps, etc.

Example: BR is https://www.wikidata.org/wiki/Q155 , so the column wd_id of line BR is Q155. With Wikidata API we can fill automatically the wd_id column.

ewheeler commented 7 years ago

Good suggestion @ppKrauss

Is there any listing of these stable country pages on wikidata? I've not found a listing/category for these or a way to crawl/fetch them all programmatically

ppKrauss commented 7 years ago

Hi @ewheeler, thanks (!), I will check best strategy next week. There are two ways,

  1. Use a list of countries at Wikipedia as source, parsing it by a little adaptation in this wikitext2CSV script. Audit advantages: is human readable and audited by English-Wikipedia community.

  2. Use SparQL and trust only in Wikidata, looking for all instances of Q6256... Or use some trusted DBpedia (as Wikidata curators) algorithm to get it.

The item 2 is the ideal solution and generates an automatic CSV.

ppKrauss commented 7 years ago

Testing solution of item 2,

SELECT ?item ?itemLabel 
WHERE {
  ?item wdt:P31 wd:Q6256.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}

run this query here and download as CSV to check JOIN.


Perhaps better! A CSV with only Wikidata-ID and 2-letter-country-code columns:

SELECT * 
WHERE {
  ?item wdt:P297 ?code
} ORDER BY ?code

here.

ppKrauss commented 7 years ago

Migration problem

Hi @ewheeler , can you help to check cause of errors at https://github.com/ppKrauss/country-codes ? The dataset is good, but terminal goodtables datapackage.json say that no.

Wikidata minor problem

I am using SQL to check and JOIN... The JOIN is:

  SELECT  c.*, w.item as "wdId" 
  FROM dataset.vw_country_codes c LEFT JOIN wikidata_country w 
    ON w.code=c.iso3166_1_alpha_2 AND c.iso3166_1_alpha_2 IS NOT NULL 
    AND w.item NOT IN ('Q165783', 'Q2895', 'Q1249802', 'Q29999', 'Q407199', 'Q838261')

The wdId nulls are for Namibia and Sark only.

item code action
Q165783 BQ delete
Q27561 BQ preserve
Q2895 BY delete
Q184 BY preserve
Q1249802 FK delete
Q9648 FK preserve
Q29999 NL delete
Q55 NL preserve
Q407199 PS delete
Q219060 PS preserve
Q838261 YU delete
Q83286 YU preserve

The duplicated pairs are about Wikidata's records on "grouping nations" as "Kingdom of the Netherlands" in the NL pair.

ppKrauss commented 6 years ago

Hi @ewheeler, sorry for coming back so late ... Now the problems are solved, all be automatic.

Submiting pull request 65 to add sh wd_countries.sh in your makefile.

Supposing that you prefer to adapt your Python scripts to the join, a new column wd_id. You can join the tables on iso2_code=ISO3166-1-Alpha-2.

Only Sark is not there, because have no iso2_code, but you can add as Q3405693.

Wikidata have persistent IDs (it's safe!), so the rule of the thumb is to preserve the older Wikidata ID (wd_id) of a country when somebody try to duplicate it editing Wikidata. For "future new nations" the rule is to check Wikidata Item at the stable English Wikipedia page. The "manual filter" is the grep line at wd_countries.sh, and is cumulative.

valerio-bozzolan commented 2 years ago

What is the blockage at the moment? Is any help needed on this? :) Thank you so much!

rufuspollock commented 2 years ago

@valerio-bozzolan PR is welcome to add this.

anuveyatsu commented 1 month ago

Cool stuff - I'm only seeing this now 👍🏼 we have this old PR that we should merge #65