Design architecture for elasticsearch population and synchronization

With a dual approach of postgres being source of truth and elasticsearch being dispensable, we need a strategy to keep both databases up-to-date.

There are two ways we could proceed:

1) Using logstash wherein we maintain an update timestamp in our RDBMS and logstash is configured to search for records updated after the last sync and overwrite them.

2) Another approach is to sync data using application layer wherein we insert, update, delete, etc simultaneously in elastic and postgres.

The advantage of the latter is complicated mutation of data can be done in java rather than through logstash plugins for transformation. The disadvantage is that we will have to manage synchronization issues ourselves.

Furthering #10, here are some lower level details of the kind of queries that elastic index will have to answer.

Entities

Entities by level (Eg: All districts)
Entities of a level that belongs to a particular higher level (Eg: All taluks in state Karnataka)
Entities inside a particular bounding box of latitude and longitude pairs

Indicators

Indicators nested within their category groups and subgroups
Indicators available within selected datasets
Indicators available at selected entity level (eg. some indicators may not be applicable for taluks, some may not be applicable to states)
Indicators defined on-the-fly by a formula

Data

Data of various combinations of filters of indicator, entity, dataset, settlement, time (and other dimensions)

Elasticsearch index that holds data is currently populated using hibernate search persistence.

One document in that index looks like this:

{
  "_index": "dataelement-000001",
  "_type": "_doc",
  "_id": "7",
  "_version": 21,
  "_score": 0,
  "_source": {
    "geography": 5,
    "indicator": 6,
    "report": 2,
    "settlement": "ANY",
    "source": 1,
    "start_time": "2015-01-01",
    "end_time": "2016-12-31",
    "upload": 3,
    "value": "1008",
    "_entity_type": "DataElement"
  }
}

As can be seen the document source is a dataelement which has the following fields:

geography: The geography it belongs to (id of the district/state/etc in the postgresql table of geographies)
indicator: The health indicator that the data pertains to
settlement: What kind of settlement does it pertain to (rural/urban/any)
start_time, end_time: What time period (denoted by start_time and end_time) does this data hold true for
source: Which source is it coming from (eg: NFHS-4)
report: Where was it reported (eg: data.gov.in)
upload: Which upload set does this belong to (When a CSV file is uploaded, all the data elements created from that is an upload set).
value: What is the actual value.

Here is the mapping at the moment:

{
  "mapping": {
    "_doc": {
      "dynamic": "strict",
      "properties": {
        "_entity_type": {
          "type": "keyword",
          "index": false
        },
        "end_time": {
          "type": "date",
          "doc_values": false,
          "format": "uuuu-MM-dd"
        },
        "geography": {
          "type": "long",
          "doc_values": false
        },
        "indicator": {
          "type": "long",
          "doc_values": false
        },
        "report": {
          "type": "long",
          "doc_values": false
        },
        "settlement": {
          "type": "keyword",
          "doc_values": false
        },
        "source": {
          "type": "long",
          "doc_values": false
        },
        "start_time": {
          "type": "date",
          "doc_values": false,
          "format": "uuuu-MM-dd"
        },
        "upload": {
          "type": "long",
          "doc_values": false
        },
        "value": {
          "type": "keyword",
          "doc_values": false
        }
      }
    }
  }
}

The respective postgresql table schemas can be found in the table creation script. Relevant parts:

CREATE TABLE public.geographies (
    id bigint NOT NULL,
    name character varying(255),
    ceased date,
    established date,
    type character varying(255),
    short_code character varying(255),
    belongs_to bigint
);

CREATE TABLE public.indicators (
    id bigint NOT NULL,
    name character varying(255),
    derivation character varying(255),
    short_code character varying(255)
);

CREATE TABLE public.reports (
    id bigint NOT NULL,
    uri character varying(255)
);

CREATE TABLE public.sources (
    id bigint NOT NULL,
    name character varying(255),
    end_time date,
    start_time date,
    belongs_to bigint
);

CREATE TABLE public.uploads (
    id bigint NOT NULL,
    report_id bigint,
    source_id bigint
);

Entities inside a particular bounding box of latitude and longitude pairs
- This will go through a map canvas? and entities will be provided to the ES?
On the ES Index: "geography": 5, "indicator": 6, "report": 2, "settlement": "ANY", "source": 1,

I think it will be better to actually have the text if the entity, indicator, etc.. actually in the ES. SO that search and filter can be more direct. That is what we have one with the Observation Index.

Putting index number will require one more call to identify the value.

{
  "mappings": {
    "_doc": {
      "dynamic_templates": [
        {
          "strings_as_keywords": {
            "match_mapping_type": "string",
            "mapping": {
              "type": "keyword"
            }
          }
        }
      ],
      "properties": {
        "Aggregation Method": {
          "type": "keyword"
        },
        "Comments": {
          "type": "keyword"
        },
        "Data source": {
          "type": "keyword"
        },
        "License": {
          "type": "keyword"
        },
        "License URL": {
          "type": "keyword"
        },
        "Limitations and Expectations": {
          "type": "keyword"
        },
        "Links": {
          "type": "keyword"
        },
        "Source": {
          "type": "keyword"
        },
        "duration": {
          "properties": {
            "type": {
              "type": "keyword"
            }
          }
        },
        "entity": {
          "properties": {
            "Dist2011": {
              "type": "keyword"
            },
            "DistCode": {
              "type": "keyword"
            },
            "district": {
              "type": "keyword"
            },
            "district_map": {
              "type": "keyword"
            },
            "state": {
              "type": "keyword"
            },
            "stateutmap": {
              "type": "keyword"
            },
            "type": {
              "type": "keyword"
            }
          }
        },
        "indicator": {
          "type": "keyword"
        },
        "indicator_category": {
          "type": "keyword"
        },
        "indicator_definition": {
          "type": "keyword"
        },
        "indicator_denominator": {
          "type": "keyword"
        },
        "indicator_expectedFrequency": {
          "type": "keyword"
        },
        "indicator_id": {
          "type": "keyword"
        },
        "indicator_methodOfEstimation": {
          "type": "keyword"
        },
        "indicator_multiplier": {
          "type": "keyword"
        },
        "indicator_normalized": {
          "type": "keyword"
        },
        "indicator_numerator": {
          "type": "keyword"
        },
        "indicator_positive_negative": {
          "type": "keyword"
        },
        "indicator_rationale": {
          "type": "keyword"
        },
        "indicator_short_name": {
          "type": "keyword"
        },
        "indicator_subcategory": {
          "type": "keyword"
        },
        "indicator_suggestedLevelOfUse": {
          "type": "keyword"
        },
        "indicator_topic": {
          "type": "keyword"
        },
        "indicator_type": {
          "type": "keyword"
        },
        "indicator_unitofmeasure": {
          "type": "keyword"
        },
        "indicator_universal_name": {
          "type": "keyword"
        },
        "settlement": {
          "type": "keyword"
        },
        "source": {
          "type": "keyword"
        },
        "value": {
          "type": "keyword"
        }
      }
    }
  }
}

This is how the mapping looks like now. Please note that almost all of these fields are dynamically created from the mappings sheets.

Metastring / HealthHeatMap