EticaAI / lexicographi-sine-finibus

Lexicographī sine fīnibus
The Unlicense
0 stars 0 forks source link

Taxonomic strategy to encode collective of humans (such as population statistics by total and thematic) by P-codes #43

Open fititnt opened 2 years ago

fititnt commented 2 years ago

After #39, we will eventually have to integrate at least population statistics. There's several sources for it, but we need at least decide the organization strategy and prepare the tooling.

fititnt commented 2 years ago

Know challenges

1. Primary: Ways to semantically encode with both tabular and graph format

The newer versions of No1 and No11 on CSV already are allowing for generalization conversion to RDF and (this is important) in the path of allowing related concepts (but which are not the same thing. This may seem strange as for tabular databases format, but is quite hard to make both without overcomplicating for end user.

We could simply do a 1:1 Wikidata mapping (some attributes already differentiate like male and female populations) bit we should still leave room for more specialized variants.

While not fully self-testable (We still rely only on frictionless validator and simplify Apache Jena riot validator), content already published on the fully automated organization @MDCIII are using stricter HXL Standard in ways which allow RDF mappings. That's why the way we would encode metadata about collections of humans by theme becomes quite a big deal. Less be used the data itself, but the well documented/predictable schemas to allow tooling integration for data which doesn't need to be public and the ones which are public, but would be easier for others to convert their data to our taxonomy and reuse everything else.

2. Secondary: the crawlers

Whole we obviously can also fetch Wikidata population statistics (and likely to do this even for testing the schemas) it is still viable to get population from other places.

However, this already becomes a part we're I'm not as sure if worth the trouble to focus on already published data on for example the Humanitarian Data Exchange (HDX) than crawler common APIs directly. Maybe it could be, but for a small subset of countries.

3. Comment: about population data for humanitarian use which is inferences

On the notes about population statistics at least from Mozambique, comments warn that humanitarians are requesting granular information (such as age by gender and disability) which simply is not available for real, so except obviously because of some extra human review, they're already simulations.

By no means I'm saying these data are bad or not worth it. In fact, well done, are cost effective. But by having other metrics which could act as seed, users can derivate on demand other inferences or compare different sources. Not something for the short term (maybe not even mid term) but the level of detail we're doing to taxonomize would allow it.

However, while it might seem strange, a major feature are the population statistics which already use PCodes. Unless we evolve the P-Codes to Wikidata QIDs, we can't automate several features which we can get really fast. We're really aiming to get things very well integrated, not mere data hoarding.

fititnt commented 2 years ago

The ./999999999/0/999999999_521850.py is the program used as data scrapper for some thematic data. Regardless of the challange pointed here https://github.com/EticaAI/lexicographi-sine-finibus/issues/45#issuecomment-1193552732 about eventually need to map 1:1 with P-Codes at subnatinal levels, the ammout of thematic data we're already able to create tabular data is so high, that brings the next issue:

To dos

Thematic themes will need stricter graph mappings

Several places can talk near the same content. HXL could work with plain natural language, but to allow automated documentation (and things like semantic reasoning without centralized server) we really need to make very well documented mappings.

This is one of the whys, the slow down can be less about data scrapping and make the crawlers working, but about... how to taxonomize them. Things which are Wikidata P could even use properties which could make sense if ingested back on Wikidata, but we're likely to deal with things that still need to be encoded, but will need be referentually to individual Q items.

The decision about naming things (e.g. the infixes to use for thematic data)

There are only two hard things in Computer Science: cache invalidation and naming things.

-- Phil Karlton

Regardless of how we use RDF or other mappings, as we don't rely on public URLs, but URNs will very well defined meaning, we need to think very well how to organize the structural numering. By the taxonomy alone already is possible to infer what data is about, and simplify a lot when the data in ingested in SQL databases (the #37), but we still need to think the entrypoints.

BFO, RDF and how to package the datasets when in tabular format

For now, in addition to think about such numeric infixes, we're likely to think how to final organization would be in terms of Basic Formal Ontology. For sake of RDF and maximize reuse for people who can't have centralized systems, this means think in ways that both basic usage (likely uses just using common tables to have entrypints) and advanced ones (triplestores, where massive number of properties, easily over what could blow up limits of columns of PostgreSQL) could work.

Likely one of the main features would be think in terms of allow one strongly suggested option of single inheritance which could still be reasonable. Other hardcore ontologists could still reshape the data later, but at least one way we could make it work and deal with real data makes sense.

Non goals of to dos

However, note that while this is a point needs documentation, things that would break reasoning even in graph format for final user (mostly notable when by ingesting data, co-existence of contradictory facts, such one fact that some administrative boundary is sovereign country, but this break other parts) we can simply don't try to assume the user will ingest all data at same time. Other point is when if users ingest two or more models to explain the world (for example, ways to express what is biological sex, what is gender identity) and this would break the reasoning because they would try to re-classify facts in different ways.

I think this limitation of not trying to make structural taxonomy to deal with such advanced inconsistencies (and justify that we need to leave users to decide what is not trivial to decide) can actually can simplify the overall structure. And, this is important, likely end users (or other ontologists) will be less likely to complain about the structural taxonomy be less flexible because some limitations would be more directed on things already likely to not make sense if they're already opinionated with the world. This approach would:

fititnt commented 2 years ago

Hum... first early attempt, while is storing the statistical data on RDF, have this issue:


Captura de tela de 2022-07-30 12-22-45


as expected, without any additional step compared to what we're doing, it will store the statistical data (great) but would lost the reference about... the years. Thinks like population (P1082) (means total population), female population (P1539), male population (P1540). urban, rural, households, these we can have shared verbs. However, for things which are very structured, like dates, maybe other variants, we need try to optimize how to add sufficient metadata to know the context.

Either the way https://sdmx.org/ or (more likely what we would do it) the way Wikidata do with Qualifiers https://www.wikidata.org/wiki/Help:Qualifiers could be our goal. However, even Wikidata tend to not have massive amount of data already very well structured such as how we use on tabular formats.

To do

While migth need future review (likely based on make easier to query the data) on sort term we can just make any strategy that at least store the reference to the date of the statistics, so at least we can say that the data on .no1.tm.hxl.csv and .no1.owl.ttl are equivalent.

fititnt commented 2 years ago

Hummm. the drafted 1603_9966_1 (from worldbank) is great, have population by year and other classifications (such gender-or-sex, urban/rural, etc), but for most cases, we also need something far simpler (only population statistics from some recent year.

Eventually more focused (without series by year) should also be available. Maybe we place with other variables