Metastring / HealthHeatMap

0 stars 0 forks source link

Supporting Upload Of Arbitrary Data #23

Open asdofindia opened 4 years ago

asdofindia commented 4 years ago

A lot of thought is required about how we should allow arbitrary datsets to be merged into well-curated datasets.

The reasons why curated data is better is:

  1. We will know what data we are working with. This allows

    • Detecting identical entities from existing data and merging existing data points with the new data points thereby increasing available data for each entity.
    • Validation of values
    • Ability to supply missing values by using a meaningful substitute.
  2. Some columns maybe better expressed as a dimension, and some as an indicator. The more complicated the data is, the more this becomes important (for ability to craft useful queries)

The reasons why arbitrary data is useful is:

  1. Quick ingestion of a new dataset.
  2. Flexibility.

The roadblocks in allowing arbitrary data is

  1. Data that introduces new dimensions will need new columns to be created in postgresql. This is not idiomatic in SQL/RDBMS world. A mitigation strategy maybe to switch to NoSQL.
  2. Arbitrariness brings in unpredictability and makes everything complicated.
asdofindia commented 4 years ago

An example of this arbitrary upload can be seen on data.gov.in

Similarly there is rawgraphs.io

But neither of these connect arbitrary uploaded data to data already in the system. It is valuable to be able to connect like that.