datasets-br / sql-unifier

Try all datasets in a single SQL table, plug-and-play!
3 stars 0 forks source link

Abandon use of SQL-rows representation, expand datasets to rows by demand #13

Closed ppKrauss closed 6 years ago

ppKrauss commented 6 years ago

(this issue was for v0.1.0, that change from v0.2+, so, use benchmarking of v0.1 as reference)

The benchmark, see Wiki, shows that

By other hand, there are some demand to add into dataset.big "any other free JSON" dataset.

Conclusion: the best is "1 dataset per row" at dataset.bit, a free-JSONb dataset with "some JSON-schema" controled by dataset.meta. We denominated it as "JSON TYPE DEFINITION", JTD, to be a set of labels indication standard structures. Only some JTDs are tabular: other need a rule to join/split into rows... The best is to stay all in "1 dataset per row".


Reference Benchmark of v0.1.0

15 datasets, load as "many rows per dataset". In terms of disk-usage, all the dataset-schema sums table_size 6080 kB. If all datasets of dataset.big was translated to usual SQL-tables, the custo will be less (!), ~75% or less.

id urn pkey lang n_cols n_rows
15 datasets-br:br_city_codes state/lexLabel pt 9 5570
14 datasets-br:br_city_synonyms state/lexLabel/synonym pt 5 26
... ... (see more at Wiki) ... ... ... ... ...
nspname n_tables total_bytes table_bytes table_size
dataset 4 8552448 6225920 6080 kB

Comparing JSON and JSONb

A test with less rows (before "datasets:world_cities"), but good results, for dataset.big total usage:

So no expressive advantage to use a JSON, with only ~2% of gain. Better stay with JSONb, a "first-class citizen".

Disk-usage reduction when all rows in one JSONb-array

Checking table_usage with "datasets-br:br_city_codes":

Conclusion: "all rows in one" is ~10 times better!

Checking performance

No loss of performance by PostgreSQL's EXPLAIN ANALYZE: as showed at Wiki, the function jsonb_array_elements() is very fast.

ppKrauss commented 6 years ago

Now, with v0.2+ it is changed. Becnhmarking same 15 datasets, but now as "tab-aoa" JTD,

id urn pkey jtd n_cols n_rows
1 (2)lexml:autoridade id tab-aoa 9 601
4 (2)lexml:evento id tab-aoa 9 14
... ... (see more at Wiki) ... ... ... ... ...

Disk-usage:

nspname n_tables total_bytes table_bytes table_size
dataset 4 1474560 73728 72 kB

Total disk-usage reduction, from 6080 kB to 72 kB, 8440%! (6080=72*84.4). No disk cost for datasets!