EticaAI / lexicographi-sine-finibus

Lexicographī sine fīnibus
The Unlicense
1 stars 1 forks source link

New data warehouse strategy (graph): RDF/SPARQL graph database populated with dictionaries data (experimental feature) #41

Open fititnt opened 2 years ago

fititnt commented 2 years ago

The way we organize the dictionaries entry point for some time already is very regular, and after the #38, there's no reason to we start to do practical tests.

To Do's of this minimal viable product

Export a format easy to parse

While #38 could be used to import on some graph database, it's not as optimized for speed. So, it's better we export at least one format as easier and compact to parse than alternatives intended to be edited by hand.

Do actual test on one or more graph database

While on #37 the SQLite is quite useful for quick debug, we would need at least one or two tests actually importing to some graph database.

We also need to somewhat take in account ways to potentially allow validation/integrity tests of the entire library as soon as it is on a graph database. It would be easier do this way.

fititnt commented 2 years ago

We do already have proof of concept of converting specially crafted BCP47 language tags (similar to HXL we've been using, to a point there mapping between both) so this eventually will be equally or actually much, much more usable than the tabular alternative of #37.

@TODO assume BCP47 -r- extension actually mimics RDF-Star

At this moment, every -g- part is a pair. This is quite easy to split by "-" the parts. However, sometimes we need push even more information to know what to do. And this is starting to become the common case, not the exception.

I think the best would be assume the BCP47 tabular heading versions should assume by default 3 instead of 2, and then if the last part is not necessary, we can use 0 to represent "self".

Current example (without mimic RDF-star)

This example does not implement all the correct semantics. Also the "bags" is not really different at all (because the idea of have different groupings is actually relevant only when tables contain different data with different levels, like the COD-ABs with have multiple concepts, but the endpoints would break then in different tables).

Also, these converters can get pretty, pretty complicated. In fact, to avoid create ourselves full-inference on python, for every bag on a tabular format, we export different triples, and then join then with something like Apache Jena (riot).

unesco-thesaurus.bcp47g.tsv

qcc-Zxxx-r-sU2200-s1    qcc-Zxxx-r-sU2203-s2-yCSVWseparator-u007c-yPREFIX-unescothes    qcc-Zxxx-r-pSKOS-broader-sS-s2-yCSVWseparator-u007c-yPREFIX-unescothes  qcc-Zxxx-r-pSKOS-narrower-sS-s2-yCSVWseparator-u007c-yPREFIX-unescothes qcc-Zxxx-r-pSKOS-related-sS-s2-yCSVWseparator-u007c-yPREFIX-unescothes  rus-Cyrl-r-pSKOS-prefLabel-sS-s1    arb-Arab-r-pSKOS-prefLabel-sS-s1    spa-Latn-r-pSKOS-prefLabel-sS-s1    qcc-Zxxx-r-pDCT-modified-txsd-datetime-sS-s1
1603:999:9  concept9            concept10   Политика в области образования  سياسة تربوية    Política educacional    2019-12-15T22:36:40Z
1603:999:10 concept10       concept4938|concept7597 concept9    Право на образование    حق في التعليم   Derecho a la educación  2019-12-15T13:26:49Z
1603:999:4938   concept4938 concept10       concept10   Возможности получения образования   فرص تربوية  Oportunidades educacionales 2019-12-15T22:36:42Z

unesco-thesaurus.rdf.ttl

not a good example, not because of the tools, but the input data

@prefix dct: <http://purl.org/dc/terms/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix wdata: <http://www.wikidata.org/wiki/Special:EntityData/> .
@prefix obo: <http://purl.obolibrary.org/obo/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix unescothes: <http://vocabularies.unesco.org/thesaurus/> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

unescothes:concept10  skos:related  unescothes:concept9 ;
        skos:narrower  unescothes:concept7597 ;
        skos:narrower  unescothes:concept4938 ;
        rdf:type       rdfs:Class .

unescothes:concept9  skos:related  unescothes:concept10 ;
        rdf:type      rdfs:Class .

unescothes:concept4938
        skos:related  unescothes:concept10 ;
        skos:broader  unescothes:concept10 ;
        rdf:type      rdfs:Class .

<urn:1603:999:9>  skos:prefLabel  "سياسة تربوية"@arb-Arab ;
        skos:prefLabel  "Политика в области образования"@rus-Cyrl ;
        skos:prefLabel  "Política educacional"@spa-Latn ;
        rdf:type        rdfs:Class ;
        dct:modified    "2019-12-15T22:36:40Z" .

<urn:1603:999:10>  skos:prefLabel  "حق في التعليم"@arb-Arab ;
        skos:prefLabel  "Право на образование"@rus-Cyrl ;
        skos:prefLabel  "Derecho a la educación"@spa-Latn ;
        rdf:type        rdfs:Class ;
        dct:modified    "2019-12-15T13:26:49Z" .

<urn:1603:999:4938>  skos:prefLabel  "فرص تربوية"@arb-Arab ;
        skos:prefLabel  "Возможности получения образования"@rus-Cyrl ;
        skos:prefLabel  "Oportunidades educacionales"@spa-Latn ;
        rdf:type        rdfs:Class ;
        dct:modified    "2019-12-15T22:36:42Z" .
fititnt commented 2 years ago

Great. Really great 😍. Rudimentar version with BCP47 (and HXL) to OWL2 works.

However, who would use OWL, very likely have different goals than who would ask for SKOS (#38), which is where we're merging up to ~200 term translations. While do is possible to work low level with CLIs and code, the friendly user interfaces for OWL also require a lot of RAM, so if we do not organize better the files we could scare best user audience (which would not have powerful computers at all).

To do's

In addition to review how to organize data warehousing with tabular format on #37 (which by the way, very likely will become more intuitive how to organize, as graph format will start us to have the big picture all the time) we need to divide the graph format data dumps in at least two formats. For sake of simplicity, we might still always save as Turtle, but the content actually be different.

1. SKOS (linguistic content; and "all content" if no structured OWL version)

We obviously can continue generate SKOS version, as we're doing (but the generator may be refactored). The way Numerordĭnātĭo works already allow to implicitly have a very rudimentary taxonomy.

Also, compared to OWL, SKOS becomes so simple that no reason to not create even for datasets which we would not have formal ontological organization at all (which might become rare cases)

2. OWL (codes, assertions, things relevant to computation)

Makes sense focus OWL exports focus 100% (even if no complex assertions) aim for interlingual codes to link the "things" (the individuals/particulars). If we need to export some labels, we can, but keep at minimum. For example, for the subset of concepts which are related to places, we're likely to only add to OWL the name of the place in the natural language used on that place.

Always would be possible to query by codes, but this also make it friendly for who actually is working on own region (so, likely to actually know the natural language).

TODO: this requires change the CLIS to exclude generated RDF triples by prefix. The tabular format would still contain references to every possible export, but we still need to restrict it on the cli that output the formats.

3. Organize/test to use case of users "merging" the OWL and SKOS

Very likely the users (even if at first, for sake of curiosity) may want to "load everything" and if we do compile on a regular basis the things, they will. So, it make sense to at least plan ahead and allow it happens in a way that if users do this, they at least do not break the reasoning.

Something we might not document at all is the fact that while users interfaces allow check the translations, tools that use OWL consider these linguistic content as mere metadata: in other words, the user migth mix OWL and the SKOS version (with huge number of translations) and eventually will discover alone that there is nothing to infer from these metadata at all. Load everything could help on nice screenshots (and maybe test things) but for something more serious, I think users focused on performance would already want a version with relevant commutable data only.

Why this approach? As soon as users learn something, is better they use they RAM to load more data in more domains.

4. OWL "properties", and terms to organize the rest of data must be optimized by default for several languages

One exception to the idea of divide computable data vs labels is... the properties and information to label rest of the entire library of everything. This is different from data itself (even if data could become RDF predicates to "document other things") because the way we document other things grow much, much slower.

For example, even Wikidata still have around 10.000 properties (https://www.wikidata.org/wiki/Wikidata:Database_reports/List_of_properties/all) despite having at the moment over 98,001,177 items (https://www.wikidata.org/wiki/Wikidata:Statistics). Even if we make some shortcut to a namespace that could represent Wikidata Ps (but unlikely we would load them by default) most RDF predicates to document other data would be far, far small.

So, for things that are so frequent on data (even if user would not load this data at first) would make sense both have then on some OWL entrypoing and already with all translations possible. The additional RAM usage would not really affect a lot.

fititnt commented 2 years ago

Okay. We managed to automate how users can load the tabular data on CSVs on SQL databases (description here https://github.com/EticaAI/lexicographi-sine-finibus/issues/37#issuecomment-1170856326).

We already export RDF Triples on a more linguistic focused version and others optimized for computational use. However, for cases where a user is planning to work with a massive amount of data, graphical interfaces such as Protege may not scale. This obviously needs more testing.

Why might be relevant SQL storage here

Turns out that is feasible use R2RML (https://www.w3.org/TR/r2rml/) or similar compatible tools such as ontop (https://ontop-vkg.org/) to map to

To allow such a feature, what would be necessary would be generating files such as R2RML and making sure users have data that perfectly matches the configuration. The beauty here is we can keep user documentation at bare minimum while allowing state of the art uses and (this is important) interfaces such as what is on Protege can be on user language.

Most of the underlying details on how to eventually reach this point are very, very boring to explain even for advanced users which wouldn't be familiar with ontology engineering, which bring us to...

Basic Formal Ontology path as foundational ontology

In any case, for future readers: we're going Basic Formal Ontology path as upper ontology, which is the most already used and by far the most referenced on sciences. But at same time the decision making is not trivial, it doesn't tolerate abstractions/vagueness such as "concept" or "agent", however the end result is culturally less likely to have divergences at all The BFO is very, very realistic.

but what about references to Wikidata (maybe others not on OBO Foundry)

This might change later, but at least for non structural content (for example, we use Wikidata P297 https://www.wikidata.org/wiki/Special:EntityData/P297.ttl , https://www.wikidata.org/wiki/Property:P297 to mentions that something is what HXL calls +v_iso2) we may use other namespaces.

However, the way default choices to organize "the skeleton" would still be BFO, which means data integration would be less painful as we really avoid patterns which would break reasoning. Again, the details about this would be boring, but in practice this means every introductory course on "how to use Protege" to categorize things (such as instances of other classes) is what we cannot do in ontologies designed to be used in production by groups which would disagree with others.

fititnt commented 2 years ago

Captura de tela de 2022-08-02 03-43-25

fititnt commented 2 years ago

Naming things is so hard.

Anyway, most of these groups will have some temporary number, starting with 99. This already allow to draft other tests. The "16039966" (`@1603{SPOP}()`) becomes a prefix for what Basic Formal ontology "object aggregate" (http://purl.obolibrary.org/obo/BFO_0000027).

While the idea is be minimalist, at least things that are fully (such as population statistics without further divisions, but allow for year and place) need to have entire base @1603_{POP}().

Draft of organization

Why this organization

At the moment, thinking in terms of ontology (as per BFO). The lower number of categories also make less prone people would put data in other places.

Note that databases could have some sort of suffixes (likely to allow cope with over 1000's columns), but the way it would works, would force user to point related things together

Question: but so HOW to categorize further?

By properties.

For example, @1603_{SPOP}() as now is explained by properties and qualifiers. It's similar, but not equal, to https://www.wikidata.org/wiki/Help:Qualifiers.

Impact on medium term on tooling / automation

There's other groups of information to organize (and @1603_{QLTY}() are big one), but if things that most people would have no reason to move away AND, because of nature of how BFO works, the end result could also allow for automated testing or inferences even for data which is not strictly well documented, but because is organized in some way (not even need be fully explained) this allow a lot of time saving.

Is likely to be more easier to explain this by going further and make it also with organizations (which also would need attributes to explain what they are) after the qualities like the @1603_{QLTY}()