EticaAI / lexicographi-sine-finibus

Lexicographī sine fīnibus
The Unlicense
0 stars 0 forks source link

New data warehouse strategy [tabular]: SQL database populated with dictionaries data (experimental feature) #37

Open fititnt opened 2 years ago

fititnt commented 2 years ago

Current know context at the moment

  • The #36 , if followed strictly, would allow creating a package importable to some database. But I'd we do it, would require duplicate more CSVs on each focused base of dictionaries
  • The #35 , from frictionless, have an experimental feature (just done a quick test, and it somewhat works) which allows write a populated SQLite database from an datapackage.json
  • The entire 1603 already designed to be friendly to allow users have everything as local copy
    • Different from generic datasets most data portals ingest, the dictionaries we do are very structured
    • The fact we use 1603 as global prefix, if the dictionaries already are on a database, users could use other global prefixes to ingest actual data and then use SQL to manipulate/transform real world data (an alternative to work CSVs directly)
  • The way we already structured the dictionaries, some from [1603:1] already are required to generate each Cōdex. _They already somewhat have an implicit schema, but the CLIs can work with plain text (the CSVs)

Idea of this issue

TODO: Experimental CLI feature to bootrapp a database from selected dictionaries (...somewhat equivalent to bootstrap a data warehouse)

Do not make sense pre-generate binary databases for end users, somewhat a waste of space. Also, users could be more interested in some dictionaries than others, so even a near single global database would both be too big, potentially be in an inconsistent state from time to time, and obviously make the compilation times absurdly huge.

However soon or later people (or at least we, for our internal use) could want to ingest everything of interest on some relational database. In fact, this would be a side effect of better data formats to explain the datasets such as the frictionless or W3C Tabular Data.

However, we can cut a lot of time (and too much pain, like commands to re-ingest dictionaries again one by one) by simply allowing (even if using the experimental features of friccionesdata) already optimized to create the full database with already selected groups of dictionaries. This also would be more aligned with the philosophy of automating what would take more documentation AND could help get a better overview of the datasets without going one by one.

Other comments

The common use case here assume data related to dictionaries can be re-bootstrapped and, when finished, no more writes would occur (at least not on the reference tables). So SQLite would be a perfect case (even for production use and huge databases, as long as no concurrent writes are necessary). However PostgreSQL (or whatever use would want to convert the SQLite) would be another alternative.

Open room for conventions to store Common Operational Datasets (at least COD-ABs)

While the dictionaries we're doing have their index handcrafted (even if the terminology translations are compiled with software) the perfect first candidates to optimize to users ingest in a predictable way would be CODs.

Note: in case we fetch data from other sources (such as @digital-guard) the actual use case here would be focus on live data, not archived data.

Before go to CODs, means optimize dictionaries that explain then

To have a sane way to ingest data, we would fist start to have dictionaries from [1603:??] Geographia (create base numerospace) #31 already done.

Our dictionaries can reuse other dictionaries (so the things get better over time) and at least on concepts related to places, the number to access the dictionary can actually mean the country.

fititnt commented 2 years ago

hummmmmmmmmmm

Captura de tela de 2022-04-29 23-45-23

fititnt commented 2 years ago

The cli we use to query the main 1603_1_1.py (and glue other operations, including exporting Codex) not surprisingly is getting a lot of options. For sake of future reference, I will leave here what was at this moment.

Since some options like --codex_de are shared, I think it is better to create options that state explicitly what operation is under active use instead of implicitly assuming from most common uses.

The full example (need click)

./99999999/0/1603_1.py --help
fititnt@bravo:/workspace/git/EticaAI/multilingual-lexicography-automation/officinam$ ./999999999/0/1603_1.py --help
usage: 1603_1 [-h] [--punctum-separato-de-resultatum [RESULTATUM_SEPARATO]]
              [--punctum-separato-de-fontem [FONTEM_SEPARATO]] [--de-archivum]
              [--data-apothecae-ad [DATA_APOTHECAE_AD]] [--data-apothecae-ex DATA_APOTHECAE_EX]
              [--data-apothecae-ex-archivo DATA_APOTHECAE_EX_ARCHIVO]
              [--data-apothecae-formato [{datapackage,sqlite}]] [--dictionaria-numerordinatio]
              [--codex-de [CODEX_DE]] [--objectivum-linguam [OBJECTIVUM_LINGUAM]]
              [--auxilium-linguam AUXILIUM_LINGUAM] [--codex-copertae] [--codex-in-tabulam-json]
              [--status-quo] [--status-in-markdown] [--status-in-datapackage]
              [--ex-librario [EX_LIBRARIO]] [--ex-opere-temporibus [EX_OPERE_TEMPORIBUS]]
              [--quaero-ix_n1603ia [QUAERO_IX_N1603IA]]
              [--quaero-numerordinatio [QUAERO_NUMERORDINATIO]] [--in-limitem [IN_LIMITEM]]
              [--in-ordinem [{numerordinatio,chaos}]] [--objectivum-formatum-asciidoctor]
              [infile]

Explain the dictionaries

positional arguments:
  infile                HXL file to read (if omitted, use standard input).

optional arguments:
  -h, --help            show this help message and exit
  --punctum-separato-de-resultatum [RESULTATUM_SEPARATO]
                        Character(s) used as separator for generate output. Used only for tabular
                        results. Defaults to tab " "
  --punctum-separato-de-fontem [FONTEM_SEPARATO]
                        Character(s) used as separator from input file Used only for tabular
                        results. Defaults to comma ","

Archivum:
  (DEFAULT USE) Use archive as source (directory not ready yet)

  --de-archivum         Parse single archive

Data apothēcae:
  data apothēcae. (One) Warehouse of datasets. Compile selected dictionaries to a single place (likely single database entry point)

  --data-apothecae-ad [DATA_APOTHECAE_AD]
                        Path to file (or reference to database) to store result
  --data-apothecae-ex DATA_APOTHECAE_EX
                        Comma-separated list of dictionaries to initialize
  --data-apothecae-ex-archivo DATA_APOTHECAE_EX_ARCHIVO
                        Path to file with list (one item per line) of dictionaries to initialize
  --data-apothecae-formato [{datapackage,sqlite}]
                        Output format. Default will try make a guess from --data-apothecae-ad
                        pattern.

Dictionaria:
  Generate dictionaries. No input required (uses disk 1603 and 999999999/1603 data files)

  --dictionaria-numerordinatio
                        Dictionary of all possible values on stricter Numerordĭnātĭo (HXLStantad
                        container)
  --objectivum-formatum-asciidoctor
                        (Default) Output Asciidoctor format

Codex:
  Book/manual creation

  --codex-de [CODEX_DE]
                        Generate documentation of dictionaries
  --objectivum-linguam [OBJECTIVUM_LINGUAM]
                        Target natural language (use if not auto-detected). Must be like {ISO
                        639-3}-{ISO 15924}. Example: arb-Arab. Default: mul-Zyyy
  --auxilium-linguam AUXILIUM_LINGUAM
                        Define auxiliary languages Must be like {ISO 639-3}-{ISO 15924}. Example:
                        "ina-Latn,ile-Latn" Accepts multiple values.
  --codex-copertae      Pre-calculate the codex, but only generate Codex cover (SVG)
  --codex-in-tabulam-json
                        Pre-calculate the codex, but only generate Tabular Data (MediaWiki syntax 1)
                        (JSON). See https://www.mediawiki.org/wiki/Help:Tabular_Data

Status quō:
  Calculate current situation. Used to take other actions. Requires --codex-de 1603_NN_NN (focused Codex). Works with --quaero-ix_n1603ia.

  --status-quo          Compute the status quo, using a codex as initial reference
  --status-in-markdown  Return status in Markdown (instead of YAML)
  --status-in-datapackage
                        Return status in frictionless datapackage.json. With --ex-librario returns
                        profile data-package-catalog. (low level of details)
  --ex-librario [EX_LIBRARIO]
                        Status novō. New state. Persist changes if necessary

Opus temporibus:
  Crontab/cronjob information 

  --ex-opere-temporibus [EX_OPERE_TEMPORIBUS]
                        ex opere temporibus. Out of work times (crontab)
  --quaero-ix_n1603ia [QUAERO_IX_N1603IA]
                        Query ix_n1603ia. Rudimentar && (AND) and || (OR). Use var<1 to test 0 or
                        undefined. Query ix_n1603ia. Filter. Ex. "{publicum}>10 && {internale}<1"
  --quaero-numerordinatio [QUAERO_NUMERORDINATIO]
                        Query Numerordĭnātĭo. Additional filter list for focused base of
                        dictionaries. Ideal to check if some groups meet other filters. Example: if
                        result return empty and other queries are to check if need to fetch again
                        from Wikidata Q, then you assume no new fetch is necessary
  --in-limitem [IN_LIMITEM]
                        /Against the limit of/. Limit maximum number of cron jobs to show.
  --in-ordinem [{numerordinatio,chaos}]
                        /Against arrangement (ordering) of/. Sort result list to this rule. Options:
                        numerordinatio=natural order; chaos=random order

Exemplōrum gratiā:
    printf "#item+conceptum+codicem,#item+rem+i_qcc+is_zxxx+ix_wikiq" | ./999999999/0/1603_1.py --de-archivum
    cat 1603/1/1/1603_1_1.no1.tm.hxl.csv | ./999999999/0/1603_1.py --de-archivum
    ./999999999/0/1603_1.py --de-archivum 1603/1/1/1603_1_1.no1.tm.hxl.csv

    ./999999999/0/1603_1.py --dictionaria-numerordinatio

    ./999999999/0/1603_1.py --codex-de 1603_63_101

    ./999999999/0/1603_1.py --codex-de 1603_63_101 --codex-copertae

    ./999999999/0/1603_1.py --codex-de 1603_63_101 --codex-in-tabulam-json

    ./999999999/0/1603_1.py --codex-de 1603_63_101 --status-quo

    ./999999999/0/1603_1.py --codex-de 1603_63_101 --status-quo --ex-librario="cdn"

    ./999999999/0/1603_1.py --codex-de 1603_63_101 --status-quo --ex-librario="locale" --status-in-markdown

    ./999999999/0/1603_1.py --codex-de 1603_63_101 --ex-opere-temporibus='cdn'

    ./999999999/0/1603_1.py --ex-opere-temporibus='cdn' --quaero-ix_n1603ia='({publicum}>=9)&&({victionarium_q}>9)'

    ./999999999/0/1603_1.py --data-apothecae-ex='1603_45_1,1603_45_31' --data-apothecae-ad='apothecae.datapackage.json'

    ./999999999/0/1603_1.py --data-apothecae-ex='1603_45_1,1603_45_31' --data-apothecae-ad='apothecae.sqlite'

Edited: text formating.

fititnt commented 2 years ago

Oh f**k me.

Except by SQLite (which is quite flexible) PostgreSQL (as it likely to happens with other databases) okay with # at start and +'s (e.g. allows entire HXL hashtags) but the columns we're adding which allows RDF mappings without any additional metadata are getting over at least 100 characters (like #item+rem+i_qcc+is_zxxx+rdf_a_obo_bfo29+rdf_p_obo_bfo124_s5002+rdf_p_obo_bfo171_s5000+rdf_s_u2200_s5001) and we could need even more than this.

TODO

In addition to a RDF with HXL and RDF with BCP47 language tags, we will need... something which could still be compatible, but if user would ingest it on SQL databases, the new reversible identifier should be predictable and reversible.

On database format, we may need to have some additional way only to store the relations of the pivots without need this become part of the header alone (as we do with with tabular format on CSV)

fititnt commented 2 years ago

By the way, the naming strategy for the tables already is very, very compact and predictable. They may be weird for humans, but as most users would use better interfaces, the use of numbers is a non-issue.

However, the way to encode the columns (at least outside SQLite or plain CSV) would need more fine tunning. Anyway, we can still take more time to make the imports to PostgresSQL/MySQL (traditional databases).

Anyway, we're already looking into R2RML (https://www.w3.org/TR/r2rml/) or, more generally, Ontology Based Data Access (OBDA), so this naturally would already take some time to think about the way to name the columns on relational database format.

Most strategies the R2RML is created by humans (even if with graphical interfaces) but we cannot do this way. So the way to name the columns migth also be done in such way that simplify R2RML

fititnt commented 2 years ago

Done first test. The FKs are not optimized, but now is viable to also load some more standard SQL databases.

TL;DR:

1. Generate the datapackage (this step allow select what should be imported from CSVs to other storages)


DATA_APOTHECAE_MINIMIS=1 ./999999999/0/1603_1.py --methodus='data-apothecae' --data-apothecae-ad-stdout --data-apothecae-formato='datapackage' --data-apothecae-ex-suffixis='no1.bcp47.csv' --data-apothecae-ex-praefixis='1603_45_16' > ./apothecae~1603_45_16.datapackage.json

1.1 SQLite (already was working)

SQLite is quite permissive on what it accepts. It was already allowing before data fields with over 60 characters (PostgreSQL and others don't)

./999999999/0/frictionless_to_sqlite.py --datapackage='apothecae~1603_45_16.datapackage.json' --sqlite='999999/0/apothecae~1603_45_16.sqlite'

Captura de tela de 2022-06-30 03-34-30

1.2 PostgreSQL

Note: this requires PostgreSQL already installed (SQLite can work with file based access)

./999999999/0/frictionless_to_postgresql.py --datapackage='apothecae~1603_45_16.datapackage.json' --postgresql='fititnt:pass@localhost/mdciii'

Captura de tela de 2022-06-30 03-33-06

1.3 Microsoft Excel (friccionless actually also accept Libreoffice, but not implemented yet)

Turns out that Excel can load fine over 470 sheets. The file on disk is 17.3MB, but I think it is using over 470mb of RAM via Libreoffice.

However, at least if the final intent would be use with command line after, I really would recommend anyone export every sheet to CSV or other storage, because opening and closing sheet by sheet via command line would require a startup time very, very significative (the way it is stored is a zip with XMLs, not as optimized as would be CSVs or SQLite)

time ./999999999/0/frictionless_to_excel.py --datapackage='apothecae~1603_45_16.datapackage.json' --excel='999999/0/apothecae~1603_45_16.xlsx'

# real  1m56,411s
# user  1m55,135s
# sys   0m1,042s

Captura de tela de 2022-06-30 03-46-11