Open asdofindia opened 4 years ago
Furthering #10, here are some lower level details of the kind of queries that elastic index will have to answer.
Elasticsearch index that holds data is currently populated using hibernate search persistence.
One document in that index looks like this:
{
"_index": "dataelement-000001",
"_type": "_doc",
"_id": "7",
"_version": 21,
"_score": 0,
"_source": {
"geography": 5,
"indicator": 6,
"report": 2,
"settlement": "ANY",
"source": 1,
"start_time": "2015-01-01",
"end_time": "2016-12-31",
"upload": 3,
"value": "1008",
"_entity_type": "DataElement"
}
}
As can be seen the document source is a dataelement which has the following fields:
Here is the mapping at the moment:
{
"mapping": {
"_doc": {
"dynamic": "strict",
"properties": {
"_entity_type": {
"type": "keyword",
"index": false
},
"end_time": {
"type": "date",
"doc_values": false,
"format": "uuuu-MM-dd"
},
"geography": {
"type": "long",
"doc_values": false
},
"indicator": {
"type": "long",
"doc_values": false
},
"report": {
"type": "long",
"doc_values": false
},
"settlement": {
"type": "keyword",
"doc_values": false
},
"source": {
"type": "long",
"doc_values": false
},
"start_time": {
"type": "date",
"doc_values": false,
"format": "uuuu-MM-dd"
},
"upload": {
"type": "long",
"doc_values": false
},
"value": {
"type": "keyword",
"doc_values": false
}
}
}
}
}
The respective postgresql table schemas can be found in the table creation script. Relevant parts:
CREATE TABLE public.geographies (
id bigint NOT NULL,
name character varying(255),
ceased date,
established date,
type character varying(255),
short_code character varying(255),
belongs_to bigint
);
CREATE TABLE public.indicators (
id bigint NOT NULL,
name character varying(255),
derivation character varying(255),
short_code character varying(255)
);
CREATE TABLE public.reports (
id bigint NOT NULL,
uri character varying(255)
);
CREATE TABLE public.sources (
id bigint NOT NULL,
name character varying(255),
end_time date,
start_time date,
belongs_to bigint
);
CREATE TABLE public.uploads (
id bigint NOT NULL,
report_id bigint,
source_id bigint
);
I think it will be better to actually have the text if the entity, indicator, etc.. actually in the ES. SO that search and filter can be more direct. That is what we have one with the Observation Index.
Putting index number will require one more call to identify the value.
{
"mappings": {
"_doc": {
"dynamic_templates": [
{
"strings_as_keywords": {
"match_mapping_type": "string",
"mapping": {
"type": "keyword"
}
}
}
],
"properties": {
"Aggregation Method": {
"type": "keyword"
},
"Comments": {
"type": "keyword"
},
"Data source": {
"type": "keyword"
},
"License": {
"type": "keyword"
},
"License URL": {
"type": "keyword"
},
"Limitations and Expectations": {
"type": "keyword"
},
"Links": {
"type": "keyword"
},
"Source": {
"type": "keyword"
},
"duration": {
"properties": {
"type": {
"type": "keyword"
}
}
},
"entity": {
"properties": {
"Dist2011": {
"type": "keyword"
},
"DistCode": {
"type": "keyword"
},
"district": {
"type": "keyword"
},
"district_map": {
"type": "keyword"
},
"state": {
"type": "keyword"
},
"stateutmap": {
"type": "keyword"
},
"type": {
"type": "keyword"
}
}
},
"indicator": {
"type": "keyword"
},
"indicator_category": {
"type": "keyword"
},
"indicator_definition": {
"type": "keyword"
},
"indicator_denominator": {
"type": "keyword"
},
"indicator_expectedFrequency": {
"type": "keyword"
},
"indicator_id": {
"type": "keyword"
},
"indicator_methodOfEstimation": {
"type": "keyword"
},
"indicator_multiplier": {
"type": "keyword"
},
"indicator_normalized": {
"type": "keyword"
},
"indicator_numerator": {
"type": "keyword"
},
"indicator_positive_negative": {
"type": "keyword"
},
"indicator_rationale": {
"type": "keyword"
},
"indicator_short_name": {
"type": "keyword"
},
"indicator_subcategory": {
"type": "keyword"
},
"indicator_suggestedLevelOfUse": {
"type": "keyword"
},
"indicator_topic": {
"type": "keyword"
},
"indicator_type": {
"type": "keyword"
},
"indicator_unitofmeasure": {
"type": "keyword"
},
"indicator_universal_name": {
"type": "keyword"
},
"settlement": {
"type": "keyword"
},
"source": {
"type": "keyword"
},
"value": {
"type": "keyword"
}
}
}
}
}
This is how the mapping looks like now. Please note that almost all of these fields are dynamically created from the mappings sheets.
With a dual approach of postgres being source of truth and elasticsearch being dispensable, we need a strategy to keep both databases up-to-date.
There are two ways we could proceed:
1) Using logstash wherein we maintain an update timestamp in our RDBMS and logstash is configured to search for records updated after the last sync and overwrite them.
2) Another approach is to sync data using application layer wherein we insert, update, delete, etc simultaneously in elastic and postgres.
The advantage of the latter is complicated mutation of data can be done in java rather than through logstash plugins for transformation. The disadvantage is that we will have to manage synchronization issues ourselves.