Document the database design iterations

asdofindia commented 4 years ago

From internal discussions, a database design has been made that is at a high level like this:

Postgres tables for everything.

1 table for indicators. 1 table for datasets metadata. 1 table for entities. N table for N datasets that represent data spreadsheets in their row form.

This will need iterations

This supersedes issue #9

The idea is to build elasticsearch indexes on top of this multi-table data to enable efficient querying.

asdofindia commented 4 years ago

A few issues that have come up.

Problem: Postgres has limits on column numbers, size, etc and the AHS spreadsheet actually hits this.

Mitigation: Backup CSVs as files on the file system when postgres cannot hold them.

Many more tables have been created.

Problem: Hierarchical nature of geographies (districts inside state, taluqs inside state, etc) need an efficient way of storage.

Mitigation 1: Store the hierarchy in Postgres. But use embedded graph databases when making hierarchical queries.

Mitigation 2: Do not allow arbitrary hierarchy. Limit them to defined hierarchy which is recorded in multiple extra columns.

asdofindia commented 4 years ago

The technical name for the database design we have arrived at now is "star schema". It is a common schema design used in data warehousing.

It is not a snowflake schema because we're storing the dimensions like "Gender", "Settlement (urban/rural)", and "time period" denormalized in the fact table itself.

Further reading:

Oracle database data warehousing guide

asdofindia commented 4 years ago

Another set of issues have come up when ingesting hospital capacity data at page 270 of National Health Profile 2019.

The question is about extensibility of dimensions when new datasets like this come up. Imagine a value called "Female" "OP visits" in "department of orthopaedics" in "Andaman Hospital" in "2018-March".

We already have dimensions for gender and time which covers Female and 2018-March. Andaman Hospital can be considered an entity. Now, the two remaining fields are "OP visits" and "Department of Orthopaedics".

We could make it an indicator like "OP visits (Orthopaedics)", "IP visits (Orthopaedics)", "OP visits (Medicine)", "IP visits (Medicine)" and so on.

But if we have OP/IP as a dimension and department as a dimension, we will be able to just use "visits" as the indicator.

Therefore, two new dimensions will have to be made - "OP/IP" and "Department".

This is just an example. There could be other datasets where the need for dimensions is more obvious.

In such cases, creating a dimension dynamically is difficult in postgresql/RDBMS because dimensions are put in columns and dynamically creating columns in RDBMS is hard and non-idiomatic.¹

asdofindia commented 4 years ago

This document has some more insights on how table data can be represented in document format.

asdofindia commented 4 years ago

Issue #14 contains initial discussions on elasticsearch mapping, etc. Should pull details from there to this issue for making this issue complete.

Metastring / HealthHeatMap

Document the database design iterations #13