dathere / datapusher-plus

A standalone web service that pushes data into the CKAN Datastore fast & reliably. It pushes real good!
GNU Affero General Public License v3.0
27 stars 21 forks source link

Optimized data type mapping to PostgreSQL data types (for speed/reduced storage/efficiency) #17

Closed jqnatividad closed 2 years ago

jqnatividad commented 2 years ago

Currently, all numeric fields (int and float) are mapped to PostgreSQL's numeric type.

https://www.postgresql.org/docs/current/datatype-numeric.html

Which is very inefficient for both storage and performance.

In DP+, we're inferring data types using its stats function. We can guarantee correct data type inferences as we scan the whole file. Whilst scanning, we also compile descriptive statistics, e.g. from the qsv whirlwind tour:

$ qsv stats wcp.csv --everything | qsv table
field       type     sum                min           max         min_length  max_length  mean                stddev              variance            lower_fence         q1          q2_median   q3          iqr                upper_fence         skew                  mode         cardinality  nullcount
Country     String                      ad            zw          2           2                                                                                                                                                                                            ru           231          0
City        String                       al lusayli   Þykkvibaer  1           87                                                                                                                                                                                           san jose     2008182      0
AccentCity  String                       Al Lusayli   özlüce      1           87                                                                                                                                                                                           San Antonio  2031214      0
Region      String                      00            Z4          0           2                                                                       -29.5               5           11          28          23                 62.5                1.3036035769599401    04           392          4
Population  Integer  2290536125         7             31480498    0           8           48730.66387966977   308414.0418510231   95119221210.88461   -33018              3730.5      10879       28229.5     24499              64978               0.36819008290764255                28460        2652350
Latitude    Float    76585211.19776328  -54.9333333   82.483333   1           12          28.371681223643343  21.938373536960917  481.292233447227    -35.9076389         12.9552778  33.8666667  45.5305556  32.5752778         94.3934723          -0.7514210842155992   50.8         255133       0
Longitude   Float    75976506.66429423  -179.9833333  180         1           14          28.14618114715278   62.472858625866486  3902.8580648875004  -98.49166745000002  2.383333    26.8802778  69.6333333  67.25000030000001  170.50833375000002  0.060789759344963286  23.1         407568       0

With min\max we can see if an int and float types will fit to PostgreSQL's smallint\integer\bigint and real\double precision types respectively, and only use numeric\decimal when it would overflow the more efficient types.

Having min\max also allow us to explore PostgreSQL's range types.

With min length\max length we can probably use character varying as well (though, per Postgres documentation, there is really no performance benefit to not use text, but perhaps for short strings with a low maxlength, we can use char varying for enforcing inferred schema constraints at the DB level ).

With cardinality, we can even automatically index certain columns based on some rules - i.e. cardinality = rowcount means a unique index; low cardinality below a certain threshold means creating an index to facilitate datastore_search_sql performance. (see #30)

With qsv frequency, we can also compile frequency tables and perhaps even explore exploiting PostgreSQL's enumerated types, if a column's frequency table is only a few values under a certain threshold.

Doing more efficient data type mapping will make the datastore more performant and space efficient, allowing it to support faster searches with datastore_search_sql queries, and take it beyond what IMHO, is currently a de facto "FTS-enabled tabular blob store" and be a big installment in taking CKAN beyond just metadata catalog use cases to being an enterprise datastore with "data lake"-like capabilities.