Open ppKrauss opened 7 years ago
Good suggestion @ppKrauss
Is there any listing of these stable country pages on wikidata? I've not found a listing/category for these or a way to crawl/fetch them all programmatically
Hi @ewheeler, thanks (!), I will check best strategy next week. There are two ways,
Use a list of countries at Wikipedia as source, parsing it by a little adaptation in this wikitext2CSV script. Audit advantages: is human readable and audited by English-Wikipedia community.
Use SparQL and trust only in Wikidata, looking for all instances of Q6256... Or use some trusted DBpedia (as Wikidata curators) algorithm to get it.
The item 2 is the ideal solution and generates an automatic CSV.
Testing solution of item 2,
SELECT ?item ?itemLabel
WHERE {
?item wdt:P31 wd:Q6256.
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
run this query here and download as CSV to check JOIN.
Perhaps better! A CSV with only Wikidata-ID and 2-letter-country-code columns:
SELECT *
WHERE {
?item wdt:P297 ?code
} ORDER BY ?code
here.
Hi @ewheeler , can you help to check cause of errors at https://github.com/ppKrauss/country-codes ?
The dataset is good, but terminal goodtables datapackage.json
say that no.
I am using SQL to check and JOIN... The JOIN is:
SELECT c.*, w.item as "wdId"
FROM dataset.vw_country_codes c LEFT JOIN wikidata_country w
ON w.code=c.iso3166_1_alpha_2 AND c.iso3166_1_alpha_2 IS NOT NULL
AND w.item NOT IN ('Q165783', 'Q2895', 'Q1249802', 'Q29999', 'Q407199', 'Q838261')
The wdId
nulls are for Namibia and Sark only.
item | code | action |
---|---|---|
Q165783 | BQ | delete |
Q27561 | BQ | preserve |
Q2895 | BY | delete |
Q184 | BY | preserve |
Q1249802 | FK | delete |
Q9648 | FK | preserve |
Q29999 | NL | delete |
Q55 | NL | preserve |
Q407199 | PS | delete |
Q219060 | PS | preserve |
Q838261 | YU | delete |
Q83286 | YU | preserve |
The duplicated pairs are about Wikidata's records on "grouping nations" as "Kingdom of the Netherlands" in the NL pair.
Hi @ewheeler, sorry for coming back so late ... Now the problems are solved, all be automatic.
Submiting pull request 65 to add sh wd_countries.sh
in your makefile.
Supposing that you prefer to adapt your Python scripts to the join, a new column wd_id
. You can join the tables on iso2_code=ISO3166-1-Alpha-2
.
Only Sark is not there, because have no iso2_code, but you can add as Q3405693.
Wikidata have persistent IDs (it's safe!), so the rule of the thumb is to preserve the older Wikidata ID (wd_id
) of a country when somebody try to duplicate it editing Wikidata. For "future new nations" the rule is to check Wikidata Item at the stable English Wikipedia page. The "manual filter" is the grep
line at wd_countries.sh
, and is cumulative.
What is the blockage at the moment? Is any help needed on this? :) Thank you so much!
@valerio-bozzolan PR is welcome to add this.
Cool stuff - I'm only seeing this now 👍🏼 we have this old PR that we should merge #65
Wikipedia have stable pages for all countries, and Wikidata supply an ID for it. Today Wikidata IDs are playing important role as "concept identifier", for Web Semantic in general and for open projects like OpenStreetMaps, etc.
Example: BR is https://www.wikidata.org/wiki/Q155 , so the column
wd_id
of lineBR
isQ155
. With Wikidata API we can fill automatically thewd_id
column.