Metastring / HealthHeatMap

0 stars 0 forks source link

Design architecture for elasticsearch population and synchronization #14

Open asdofindia opened 4 years ago

asdofindia commented 4 years ago

With a dual approach of postgres being source of truth and elasticsearch being dispensable, we need a strategy to keep both databases up-to-date.

There are two ways we could proceed:

1) Using logstash wherein we maintain an update timestamp in our RDBMS and logstash is configured to search for records updated after the last sync and overwrite them.

2) Another approach is to sync data using application layer wherein we insert, update, delete, etc simultaneously in elastic and postgres.

The advantage of the latter is complicated mutation of data can be done in java rather than through logstash plugins for transformation. The disadvantage is that we will have to manage synchronization issues ourselves.

asdofindia commented 4 years ago

Furthering #10, here are some lower level details of the kind of queries that elastic index will have to answer.

Entities

Indicators

Data

asdofindia commented 4 years ago

Elasticsearch index that holds data is currently populated using hibernate search persistence.

One document in that index looks like this:

{
  "_index": "dataelement-000001",
  "_type": "_doc",
  "_id": "7",
  "_version": 21,
  "_score": 0,
  "_source": {
    "geography": 5,
    "indicator": 6,
    "report": 2,
    "settlement": "ANY",
    "source": 1,
    "start_time": "2015-01-01",
    "end_time": "2016-12-31",
    "upload": 3,
    "value": "1008",
    "_entity_type": "DataElement"
  }
}

As can be seen the document source is a dataelement which has the following fields:

Here is the mapping at the moment:

{
  "mapping": {
    "_doc": {
      "dynamic": "strict",
      "properties": {
        "_entity_type": {
          "type": "keyword",
          "index": false
        },
        "end_time": {
          "type": "date",
          "doc_values": false,
          "format": "uuuu-MM-dd"
        },
        "geography": {
          "type": "long",
          "doc_values": false
        },
        "indicator": {
          "type": "long",
          "doc_values": false
        },
        "report": {
          "type": "long",
          "doc_values": false
        },
        "settlement": {
          "type": "keyword",
          "doc_values": false
        },
        "source": {
          "type": "long",
          "doc_values": false
        },
        "start_time": {
          "type": "date",
          "doc_values": false,
          "format": "uuuu-MM-dd"
        },
        "upload": {
          "type": "long",
          "doc_values": false
        },
        "value": {
          "type": "keyword",
          "doc_values": false
        }
      }
    }
  }
}

The respective postgresql table schemas can be found in the table creation script. Relevant parts:

CREATE TABLE public.geographies (
    id bigint NOT NULL,
    name character varying(255),
    ceased date,
    established date,
    type character varying(255),
    short_code character varying(255),
    belongs_to bigint
);

CREATE TABLE public.indicators (
    id bigint NOT NULL,
    name character varying(255),
    derivation character varying(255),
    short_code character varying(255)
);

CREATE TABLE public.reports (
    id bigint NOT NULL,
    uri character varying(255)
);

CREATE TABLE public.sources (
    id bigint NOT NULL,
    name character varying(255),
    end_time date,
    start_time date,
    belongs_to bigint
);

CREATE TABLE public.uploads (
    id bigint NOT NULL,
    report_id bigint,
    source_id bigint
);
rsprabha commented 4 years ago
  1. Entities inside a particular bounding box of latitude and longitude pairs
    • This will go through a map canvas? and entities will be provided to the ES?
  2. On the ES Index: "geography": 5, "indicator": 6, "report": 2, "settlement": "ANY", "source": 1,

I think it will be better to actually have the text if the entity, indicator, etc.. actually in the ES. SO that search and filter can be more direct. That is what we have one with the Observation Index.

Putting index number will require one more call to identify the value.

asdofindia commented 4 years ago
{
  "mappings": {
    "_doc": {
      "dynamic_templates": [
        {
          "strings_as_keywords": {
            "match_mapping_type": "string",
            "mapping": {
              "type": "keyword"
            }
          }
        }
      ],
      "properties": {
        "Aggregation Method": {
          "type": "keyword"
        },
        "Comments": {
          "type": "keyword"
        },
        "Data source": {
          "type": "keyword"
        },
        "License": {
          "type": "keyword"
        },
        "License URL": {
          "type": "keyword"
        },
        "Limitations and Expectations": {
          "type": "keyword"
        },
        "Links": {
          "type": "keyword"
        },
        "Source": {
          "type": "keyword"
        },
        "duration": {
          "properties": {
            "type": {
              "type": "keyword"
            }
          }
        },
        "entity": {
          "properties": {
            "Dist2011": {
              "type": "keyword"
            },
            "DistCode": {
              "type": "keyword"
            },
            "district": {
              "type": "keyword"
            },
            "district_map": {
              "type": "keyword"
            },
            "state": {
              "type": "keyword"
            },
            "stateutmap": {
              "type": "keyword"
            },
            "type": {
              "type": "keyword"
            }
          }
        },
        "indicator": {
          "type": "keyword"
        },
        "indicator_category": {
          "type": "keyword"
        },
        "indicator_definition": {
          "type": "keyword"
        },
        "indicator_denominator": {
          "type": "keyword"
        },
        "indicator_expectedFrequency": {
          "type": "keyword"
        },
        "indicator_id": {
          "type": "keyword"
        },
        "indicator_methodOfEstimation": {
          "type": "keyword"
        },
        "indicator_multiplier": {
          "type": "keyword"
        },
        "indicator_normalized": {
          "type": "keyword"
        },
        "indicator_numerator": {
          "type": "keyword"
        },
        "indicator_positive_negative": {
          "type": "keyword"
        },
        "indicator_rationale": {
          "type": "keyword"
        },
        "indicator_short_name": {
          "type": "keyword"
        },
        "indicator_subcategory": {
          "type": "keyword"
        },
        "indicator_suggestedLevelOfUse": {
          "type": "keyword"
        },
        "indicator_topic": {
          "type": "keyword"
        },
        "indicator_type": {
          "type": "keyword"
        },
        "indicator_unitofmeasure": {
          "type": "keyword"
        },
        "indicator_universal_name": {
          "type": "keyword"
        },
        "settlement": {
          "type": "keyword"
        },
        "source": {
          "type": "keyword"
        },
        "value": {
          "type": "keyword"
        }
      }
    }
  }
}

This is how the mapping looks like now. Please note that almost all of these fields are dynamically created from the mappings sheets.