EticaAI / lexicographi-sine-finibus

Lexicographī sine fīnibus
The Unlicense
0 stars 0 forks source link

Mappings from UN P-Codes and Wikidata Q IDs #45

Open fititnt opened 2 years ago

fititnt commented 2 years ago

While the mappings at least at admin0 ("country level") is straightforward (since we can map ISO 3166-1 used on P-Codes prefix and UN m49), things get tricky already at admin boundary level 1. We know some UN PCode patterns of at least some regions (such as the case of P-Codes from Brazil) which we're even lucky have a mapping ready to use like https://www.wikidata.org/wiki/Property:P1585. But not sure about the rest.

Why such mappings becomes relevant

Even if we only manage to somewhat make mappings at best case of admin 1 and only specific administrative regions got very detailed, this alone already allow get more data from Wikidata, which is by far the best place different persons and organizations use it. I personally think (at least as soon as it get decent) worth allow publish such mappings as dedicated public domain dataset, so ITOS or OCHA can at least use it even if for internal comparisons. However, this "soon" can take time and is more likely that for population statistics such as #43, the data from such mappings would be less accurate than what OCHA have, in special for countries with active crisis.

However, in any case, the mappings start allow we know much more mappings (including OpenStreetMap and UN/LOCODE). But by no means I think this will be something ready anytime soon (assuming is something that could be ready at all, since regions can change over time).

Potential approaches

Tooling specialized to integrate intermediate controlled vocabularies

Note: by "intermediate controlled vocabularies" we're talking about anything that could be used to triangulate what could later be assumed to be an exact match with P-Codes

This topic alone will require create several scripts and strategies (even if the early ones would become not as necessary in the medium term) to start know how to make the other relations. The ones we should do more attention are what is relevant to run from time to time to discover new changes.

1. (Not sure, needs testing) maybe compare by matching geometries

At the moment we did not attempted to run tools which could make any type of matching by geometries, but while this definitely would need human intervention, maybe it could work.

To reach this point, not only we would need to create the scripts, but likely allow it run (maybe weekly or monthly) to check the official COD-ABs with what whatever is on Wikidata uses.

2. Trying reverse engineering numeric part of P-Codes (and hope already exist Wikidata P with them)

Since the documentation on how to design P-Codes for more than one decade already recommended to try reuse existing country codes, is likely that more regions would have equivalences such as IBGE Code P1585. The only thing we're sure is that all P-Codes without admi0 prefix are fully numeric (with few exceptions), so this already exclude a lot of potential existing codes

However, the new problem would become if other countries do have mappings on Wikidata P property (and such mappings be as updated as P1585 by others). Otherwise, even if we could know country by country how the P-Codes where designed without try and error (and they be be an 1:1 matching P-Code, which, again, we can't take for granted without human intervention) we cannot use it.

In any case, whatever would be the strategy to map P-Codes to Wikidata Q, we would need to document very well to allow revision.

3. Other inferences

There's several other codes on Wikidata, from OpenStreet Map (https://www.wikidata.org/wiki/Property:P402), UN/LOCODE (https://www.wikidata.org/wiki/Property:P1937), HASC (https://www.wikidata.org/wiki/Property:P8119) to a popular one, the GeoNames (https://www.wikidata.org/wiki/Property:P1566, this one not sure why somethines have more than one code for same place). They might somewhat allow some way to triangulate with P-Codes, but not sure at the moment.

fititnt commented 2 years ago

@TODO edit one by one and add missing UN m49 directly on Wikidata

Wikipedia do have most of the ISO 3166-1 numeric (which, by definition, would be equivalent to UN m49 codes). However, the UN m49 codes themselves have only 16 over 177 ISO 3166-1 numeric items.

Not as sure if automate this or do one by one, but even if take some time, makes sense to add the missing numeric codes, so we can simplify the queries.

Trivia

From the list of existing codes, seems previous people started from letter A to add the codes then stopped. I'm not evey surprised with this, since lexicography, is quite common the first letters be doing with far more attention/time and then the rest be rushed.

Captura de tela de 2022-07-13 23-16-59

Current SPARQL query

This query is somewhat manual, but the idea would be generate it based on what Wikidata P's would be related to adm0.

In any case, this more manual query at least allow know which properties have more data. Some like the KML (https://www.wikidata.org/wiki/Property:P3096) have very few points, while the geoshape (https://www.wikidata.org/wiki/Property:P3896) seems to have for every adm0 we're getting (returns a link to geojson, based on data published by datahub; not as perfect as the COD-ABs and I'm very sure not as updated, but at least allow quick use of some geometries for other checks).

SELECT
  (xsd:integer(?ix_iso3166p1n) AS ?item__conceptum__codicem)
  (STRAFTER(STR(?item), "entity/") AS ?item__rem__i_qcc__is_zxxx__ix_wikiq)
  #?item__rem__i_qcc__is_zxxx__ix_iso3166p1n
  (GROUP_CONCAT(DISTINCT ?ix_iso3166p1n; separator = "|") AS ?item__rem__i_qcc__is_zxxx__ix_iso3166p1n)
  (GROUP_CONCAT(DISTINCT ?ix_unm49; separator = "|") AS ?item__rem__i_qcc__is_zxxx__ix_unm49)
  (GROUP_CONCAT(DISTINCT ?ix_iso3166p1a2; separator = "|") AS ?item__rem__i_qcc__is_zxxx__ix_iso3166p1a2)
  (GROUP_CONCAT(DISTINCT ?ix_iso3166p1a3; separator = "|") AS ?item__rem__i_qcc__is_zxxx__ix_iso3166p1a3)
  (GROUP_CONCAT(DISTINCT ?ix_unescothes; separator = "|") AS ?item__rem__i_qcc__is_zxxx__ix_unescothes)
  (GROUP_CONCAT(DISTINCT ?ix_unagrovoc; separator = "|") AS ?item__rem__i_qcc__is_zxxx__ix_unagrovoc)
  (GROUP_CONCAT(DISTINCT ?ix_xzosmrel; separator = "|") AS ?item__rem__i_qcc__is_zxxx__ix_xzosmrel)
  (GROUP_CONCAT(DISTINCT ?ix_xzgeonames; separator = "|") AS ?item__rem__i_qcc__is_zxxx__ix_xzgeonames)
  (GROUP_CONCAT(DISTINCT ?ix_jpgeolod; separator = "|") AS ?item__rem__i_qcc__is_zxxx__ix_jpgeolod)
  (GROUP_CONCAT(DISTINCT ?ix_usworldnet; separator = "|") AS ?item__rem__i_qcc__is_zxxx__ix_usworldnet)
  (GROUP_CONCAT(DISTINCT ?ix_usfactbook; separator = "|") AS ?item__rem__i_qcc__is_zxxx__ix_usfactbook)
  (GROUP_CONCAT(DISTINCT ?ix_xzgithubt; separator = "|") AS ?item__rem__i_qcc__is_zxxx__ix_xzgithubt)
  (GROUP_CONCAT(DISTINCT ?ix_zzwgs84point; separator = "|") AS ?item__rem__i_qcc__is_zxxx__ix_zzwgs84point)
  (GROUP_CONCAT(DISTINCT ?ix_zzgeojson; separator = "|") AS ?item__rem__i_qcc__is_zxxx__ix_zzgeojson)

WHERE
{
  ?item wdt:P31 wd:Q6256 ;
  OPTIONAL { ?item wdt:P2082 ?ix_unm49. }
  ?item wdt:P299 ?ix_iso3166p1n.
  OPTIONAL { ?item wdt:P297 ?ix_iso3166p1a2. }
  OPTIONAL { ?item wdt:P298 ?ix_iso3166p1a3. }
  OPTIONAL { ?item wdt:P3916 ?ix_unescothes. }
  OPTIONAL { ?item wdt:P8061 ?ix_unagrovoc. }
  OPTIONAL { ?item wdt:P402 ?ix_xzosmrel. }
  OPTIONAL { ?item wdt:P1566 ?ix_xzgeonames. }
  OPTIONAL { ?item wdt:P5400 ?ix_jpgeolod. }
  OPTIONAL { ?item wdt:P8814 ?ix_usworldnet. }
  OPTIONAL { ?item wdt:P9948 ?ix_usfactbook. }
  OPTIONAL { ?item wdt:P9100 ?ix_xzgithubt. }
  OPTIONAL { ?item wdt:P625 ?ix_zzwgs84point. }
  OPTIONAL { ?item wdt:P3896 ?ix_zzgeojson. }
}
GROUP BY ?item ?ix_iso3166p1n
ORDER BY ASC(?item__rem__i_qcc__is_zxxx__ix_iso3166p1n)

The result CSV

wikidata-admin0--2022-07-13.csv

fititnt commented 2 years ago

Done (at least first batch) comparing both with Wikidata and page https://unstats.un.org/unsd/methodology/m49/. All existing ISO 3166-1 numeric match UN m49 (actually used then to be sure the pasted number would already exist on each page. Do exist some corner cases (UN M.49 528 https://www.wikidata.org/wiki/Q29999 vs https://www.wikidata.org/wiki/Q55, which an historical country exist; opted to only add UN M49 to non-historical administrative areas).

@TODO also add numeric codes even for itens without ISO 3166-1 numeric on Wikidata

While there repeated Q29999, the 177 lines of data on wikidata-admin0--2022-07-14.csv still less than 249 on the list https://unstats.un.org/unsd/methodology/m49/. This could be done by manually comparing ISO 3166-1 alpha 3 and region name in natural language.

fititnt commented 2 years ago

Current draft of adm0

We shifted from hacky way of ix_iso3166p1n to ix_unm49 as direct generation of local numeric identifiers. From now on, then become necessary keep Wikidata up to date.

Trivia: the way Wikidata validation rules work, means adm0 without ix_iso3166p1a2 would raise warnings, so this means some countries which are not current (like Kingdom of the Netherlands (Q29999)) would return results. There's some discussion like https://www.wikidata.org/wiki/Talk:Q29999#Is_country_tag_really_appropriate for cases like this, but since no one is looking too much on UN m49 as key to decide other Wikidata validation rules, the ones we're using might be more easier to maintain.

SELECT
  (xsd:integer(?ix_unm49) AS ?item__conceptum__codicem)
  (STRAFTER(STR(?item), "entity/") AS ?item__rem__i_qcc__is_zxxx__ix_wikiq)
  (GROUP_CONCAT(DISTINCT ?ix_iso3166p1n; separator = "|") AS ?item__rem__i_qcc__is_zxxx__ix_iso3166p1n)
  (GROUP_CONCAT(DISTINCT ?ix_unm49; separator = "|") AS ?item__rem__i_qcc__is_zxxx__ix_unm49)
  (GROUP_CONCAT(DISTINCT ?ix_iso3166p1a2; separator = "|") AS ?item__rem__i_qcc__is_zxxx__ix_iso3166p1a2)
  (GROUP_CONCAT(DISTINCT ?ix_iso3166p1a3; separator = "|") AS ?item__rem__i_qcc__is_zxxx__ix_iso3166p1a3)
  (GROUP_CONCAT(DISTINCT ?ix_iso3166p2; separator = "|") AS ?item__rem__i_qcc__is_zxxx__ix_iso3166p2)
  (GROUP_CONCAT(DISTINCT ?ix_unescothes; separator = "|") AS ?item__rem__i_qcc__is_zxxx__ix_unescothes)
  (GROUP_CONCAT(DISTINCT ?ix_unagrovoc; separator = "|") AS ?item__rem__i_qcc__is_zxxx__ix_unagrovoc)
  (GROUP_CONCAT(DISTINCT ?ix_xzosmrel; separator = "|") AS ?item__rem__i_qcc__is_zxxx__ix_xzosmrel)
  (GROUP_CONCAT(DISTINCT ?ix_xzgeonames; separator = "|") AS ?item__rem__i_qcc__is_zxxx__ix_xzgeonames)
  (GROUP_CONCAT(DISTINCT ?ix_jpgeolod; separator = "|") AS ?item__rem__i_qcc__is_zxxx__ix_jpgeolod)
  (GROUP_CONCAT(DISTINCT ?ix_usworldnet; separator = "|") AS ?item__rem__i_qcc__is_zxxx__ix_usworldnet)
  (GROUP_CONCAT(DISTINCT ?ix_usfactbook; separator = "|") AS ?item__rem__i_qcc__is_zxxx__ix_usfactbook)
  (GROUP_CONCAT(DISTINCT ?ix_xzgithubt; separator = "|") AS ?item__rem__i_qcc__is_zxxx__ix_xzgithubt)
  (GROUP_CONCAT(DISTINCT ?ix_zzwgs84point; separator = "|") AS ?item__rem__i_qcc__is_zxxx__ix_zzwgs84point)
  (GROUP_CONCAT(DISTINCT ?ix_zzgeojson; separator = "|") AS ?item__rem__i_qcc__is_zxxx__ix_zzgeojson)

WHERE
{
  ?item wdt:P31 wd:Q6256 ;
  wdt:P2082 ?ix_unm49 ;
  OPTIONAL { ?item wdt:P299 ?ix_iso3166p1n . }
  OPTIONAL { ?item wdt:P297 ?ix_iso3166p1a2 . }
  OPTIONAL { ?item wdt:P298 ?ix_iso3166p1a3 . }
  OPTIONAL { ?item wdt:P300 ?ix_iso3166p2 . }
  OPTIONAL { ?item wdt:P3916 ?ix_unescothes . }
  OPTIONAL { ?item wdt:P8061 ?ix_unagrovoc . }
  OPTIONAL { ?item wdt:P402 ?ix_xzosmrel . }
  OPTIONAL { ?item wdt:P1566 ?ix_xzgeonames . }
  OPTIONAL { ?item wdt:P5400 ?ix_jpgeolod . }
  OPTIONAL { ?item wdt:P8814 ?ix_usworldnet . }
  OPTIONAL { ?item wdt:P9948 ?ix_usfactbook . }
  OPTIONAL { ?item wdt:P9100 ?ix_xzgithubt . }
  OPTIONAL { ?item wdt:P625 ?ix_zzwgs84point . }
  OPTIONAL { ?item wdt:P3896 ?ix_zzgeojson . }
}
GROUP BY ?item ?ix_unm49
ORDER BY ASC(?item__conceptum__codicem)

Result CSV

TODOs

Note: manual addition of other UN m49 on Wikidata still not done (around 50-70 are missing). For now focusing on get other things working.

Minimal viable product of administrative boundaries level 1

If the country (sense: sovereign state (Q3624078)) already is complicated, already at adm1 (first subdivision) things even more interesting. Now, even adm0 without disputes can be arranged in ways that have more than one strategy to reach then. And unless we know upfront every Wikidata Q item for subregions, the way to get such regions get complex.

In practice, we may need to first bootstrap each P-Code to Q IDs, then reuse Q IDs to everything else. From time to time, the Wikidata may change some parts on how to query data, so this approach may be more fail safe.

The https://www.wikidata.org/wiki/Wikidata_talk:WikiProject_Country_subdivision/Items does have an query for have a quick look at divisions

SELECT ?country ?countryLabel ?item ?itemLabel ?level ?expected ?found ?samenumber
WITH {
  SELECT ?item ?expected ?country ?level (COUNT(DISTINCT ?place) AS ?found) {
    ?item wdt:P279* ?acs ; wdt:P17 ?country.
    FILTER NOT EXISTS { ?country wdt:P576 [] }
    ?acs p:P279 [ ps:P279 wd:Q1799794 ; pq:P1545 ?level ] .

    OPTIONAL { ?item wdt:P1114 ?expected }    
    OPTIONAL { 
      ?place p:P31 ?placeStatement .
      ?placeStatement ps:P31 ?item.
      FILTER NOT EXISTS { ?placeStatement wdt:P582 [] }
    }  
  } 
  GROUP BY ?item ?expected ?country ?level
} AS %subdivisions
WHERE {
  include %subdivisions.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
  BIND(IF(?expected = ?found, "✓", "✘") AS ?samenumber).
} 
ORDER BY ?countryLabel ?level DESC(?expected) ?itemLabel

This query gives a hint on ways to reach the subdivisions and try to compare if the numbers match. However, the reason for numbers do not match might be a reflection valid reasons. In general it 'expect' lower numbers than what results find, so I think it migth be at simpler cases disputed territories or references to items (at first subdivision) which are historical, like Kingdom of the Netherlands, but are so much items that they might not have a direct way to know if they're historical (or, if does have, the SPARQL would me huge, huge query.

fititnt commented 2 years ago

Okay. Done. he public repo https://github.com/MDCIII/1603_16_1 (more specifically this commit https://github.com/MDCIII/1603_16_1/commit/7f328dc1755e71866613f5b9f1312f7bb8cee316) is starting to automate administrative boundaries level 0 from far more data.

Notable to dos

metadata for level 0

Since we're able to generate at least much more metadata as the "country/territory" level the MDCIII/1603_16_1 (1603_16_1_0) can be expanded by every other 1603_16_N_0 on MDCIII.

Compared to other tasks, this one is more trivial.

Less trivial to dos

metadata for level 1

Already at level 1, the mappings already are hard to scale up because neither (if we want to link) the CODs from OCHA or Wikidata itself (if using other vocabularies, such as ISO 3166-2 (https://en.wikipedia.org/wiki/ISO_3166-2).

Not saying that is not feasible, just that we would need do several triangulation. The data most likely already is on Wikipedia, but migth not be interlinked on Wikidata.

And, even if we could extract tables from ISO 3166-2, these nodes likely would need to me matched against P-Codes. And (this is relevant) not require overly manual review. In practice, we might need already have static checkers or validators to warn inconsistencies, but from time to time things would be wrong because disputed territories.

metadata for level 2 and beyond

For Brazil we already know how to do it (IBGE codes already are 1:1 with UN PCodes) and the level 1 of Brazil is few states that worth some manual copy pasting.

However, for other regions, we already would start to try check like for geometry or something. So depending of the focuses, we could use Brazil as benchmark to check if automation would work.

However, I'm already sure for small countries this will always be hard. Not surprisingly, things can evolve over time.