Storing Metadata on Indicators

asdofindia commented 4 years ago

Imagine an indicator like "Maternal Mortality Rate"

There are several attributes of this particular indicator that are useful. Some examples:

A unique short code that universally refers to this particular indicator.
A definition in free text on what it means.
Detail on how the indicator is calculated: For example, maternal mortality rate is calculated as number of maternal deaths in a time period divided by number of women in reproductive age group (in 1000s)
Information on whether it is a positive indicator (higher value is better health) or a negative indicator (lower value is better health), or something that's optimum in certain ranges.
Information on whether that indicator gets aggregated in an additive way or a non-additive way. For example
- maternal mortality rate above is a rate and therefore you cannot add the rates for all the districts in a state and get the value for the state. Interestingly, you cannot also average the rates of all districts in a state and get the value for the state. To get the correct value of state level maternal mortality rate from the values of all districts under that state, one has to do a weighted average where the weight is the denominator/sum of denominators (that is, the number of women in reproductive age group in each district/total number of women in reproductive age group in the state).
- population, on the other hand is something that can be simply added up.
A list of closely related indicators (eg, maternal mortality ratio, perinatal mortality rate, etc)

Now, there are certain other attributes that arise when an indicator is taken in relation with a source of data. These include:

How is the value calculated (eg: estimation, survey, taken from a registry, etc)
What is the name used by the source to refer to this indicator. (Eg, it could be different from "Maternal Mortality Rate", the source could have used a word like (maternal death rate)

There are certain other attributes which could be considered like

Tags or categories under which this indicator can be grouped under. Where difference between tag and category is that one indicator can have multiple tags while category is usually used in a stricter way where one indicator can only come in one category.

Examples of indicators

Previous work

Related reading:

HEALTH INDICATORS: Conceptual and operational considerations a document that barely scratches the issues described here, but have some tangential ideas
Multidimensional Databases and Data Warehousing - a book that describes terminologies like "dimension", "fact", "measurement", etc.
Multidimensional Data Modeling for Complex Data - an old paper that describes the elements like additive data, non-additive data, etc.

rsprabha commented 4 years ago

Thanks for this discussion document. WHO seems to have the idea of Core Heatlth Indicators in the document that you sent. Maybe worthwhile to look at this as well and see how many are captured in our system. It may also be worth while to have other metadata on the indicators like, the range of values for the indicator and charecteristics of its distribution.

asdofindia commented 4 years ago

Indicators (and other dimensions) in real life exist independent of data.

For example, there could be the indicator "Happiness quotient". This could be not available in any datasets that we have.

Capturing such indicators require us to think whether we need to have a different index just to store all the possible and potential values of all the dimensions, regardless of whether there actually is data corresponding to those.

Another example is a composite indicator that is calculated on the fly. For example, imagine we create an indicator called "maternal health index" which is a sum of three other indicators that are present in the dataset. Now, either we can precompute this value for every possible combination of dimensions and store it or we can compute it on demand. If we are doing the latter, where would the maternal health index be stored?

asdofindia commented 4 years ago

Some preliminary thoughts on tackling some of the fields (numerator-denominator, etc)

Continuing from the comment on aggregation methods.

Essentially, things like "numerator", "denominator", "multiplying factor" are all part of how the value of an indicator is calculated. For rates, there will always be a numerator and denominator. The multiplying factor comes in when we want to describe something in "percentage" or "per thousand" where we multiply the rate by 100, 1000, etc respectively.

Now, we could imagine an indicator which is not strictly a rate. For example, imagine the maternal health index in the comment above. Maternal health index score could be defined as the result of the calculation 1 / ("maternal mortality rate" + "perinatal mortality rate" + "proportion of women receiving Iron Folic Acid"). Now, this can be expressed as numerator and denominator too. But the denominator here is not simple. It is some of three things.

A way, therefore, to consider, would be to capture a "formula" to calculate the value of an indicator.

For example, we could capture the formula of maternal mortality rate as

{
  "multiply": [
       { "divide": ["number of maternal deaths", "number of women in reproductive age group"] },
      100
   ]
}

We could similarly capture the formula for maternal health index as

{"divide": [1, {"sum": [ "maternal mortality rate", "perinatal mortality rate", "proportion of women receiving Iron Folic Acid"]}]

Similarly, I think, we can capture any metadata that maybe required for calculation of indicators.

Now, of course, the entities in these formulae have to be unique references to indicators (and not plain string like shown above)

How would we then calculate these values (for indicators whose value is not present in any of our sources, but the components which appear in their formula do have values in our dataset)?

We can use a scripted metric aggregation inside a bucket aggregation to calculate these values inside each bucket.

asdofindia commented 4 years ago

The other complicated thing to capture on an indicator is how it should get aggregated.

There are only so many ways values can get aggregated.

SUM (which means values just get added)
AVERAGE (which means values just get averaged, there is a question of whether to drop NA values or not when averaging)
WEIGHTED AVERAGE (where values get averaged according to a weight that is given by another variable. Again there is a question of whether to drop NA)

Metastring / HealthHeatMap

Storing Metadata on Indicators #24