EticaAI / HXL-Data-Science-file-formats

Common file formats used for Data Science and language localization exported from (and to) HXL (The Humanitarian Exchange Language)
https://hdp.etica.ai/
The Unlicense
3 stars 1 forks source link

[meta issue] HXL and data directly from and to SQL databases #10

Open fititnt opened 3 years ago

fititnt commented 3 years ago

This issue is an draft. Some extra information may be edited/added later.

fititnt commented 3 years ago

TL;DR of this post: maybe the HXLMeta (Usable Class) #9 could also have an local database to use as helper.

Maybe most databases accept HXL Hashtags without changes

Good thing: PostgreSQL actually accept # as first char of column names. (Also tested with MariaDB, so maybe do exist others)

On last week's I was aware that only SQLite accepted almost everything as column name, so I was concerned on what use to replace # and + on almost every other database engine. But since maybe current databases actually allow both # and +, this simplify a lot!

Tests with csvsql

The csvsql used SQLAlchemy. The file actually is not that complex so in worst case scenario could be just Implement same thing. (But as reference: as expected the csvsql exporting from genetic csv files may be OK, but not tested yet if types would be more generic)

BUT since on issue HXLMeta we're already mapping more exact StorageTypes (and this is likely to take much more time to get right) I think that for exporting from HXLated datasets to most common SQL database, we may no need at all something like SQLAlchemy.

But for importing to HXL tools, some abstraction like SQLAlchemy (at least for python HXL tools) definely worth looking at.

SQLite as potential alternative to have an local collections of taxonomies

In addition to country/territory codes (and without resorting to load the entire P-Codes for local usage) do exist some taxonomies (at least the one for language codes) that I think eventually would be useful to have near the computer running complex inferences. On @HXL-CPLP we're already drafting taxonomies like words used to represent true/false on different languages, so maybe would be possible to make some taxonomies so important that the user could build own cache. One good initial candidate could be booleans (using 2 letter ISO codes as namespace, something like +v_un_bool for 6 UN languages and +v_eu_bool for an draft of 20+ European ones) and some way to a person "merge" more than one external source of reference.

I'm not fully sure if this would be really necessary (and, in fact, for a few tables, even a folder with plain HXLated CSVs would work). But for cases like the booleans, just a canonical single table would not be ideal (if not because of the user, then because of make harder to implement on the fly)

But anyway, in both cases (local SQLite or CSVs) something that could "build" an local database but one that could persist across executions (and also one that could work offline so if really used a lot on worst case scenario neither Google Drive could get hate limited or blocked) seems a totally win-win.