Discussion: How should we present the data

jm-rivera commented 1 year ago

There are a variety of ways to group/organise and present the data.

Should our visualisations be:

At country level
Based on income, regional, or other groupings
A combination (how/which)

What would be most helpful?

If grouped, are we happy to use group median values?

If grouped, we would need to 'stabilise' groups somehow (in case the time series have gaps). This could be done by:

establishing a minimum threshold for inclusion (data available for x% of years).
In order to minimise lost data, we could interpolate if data is missing for certain years ('inner' so only creating linear interpolations between two known values).

nupur-parikh commented 1 year ago

Is it possible to do a combo/all three of the following groups:

Africa (total)
country level for all African countries
Income group

I think some version of all three would be most helpful, but I'm not sure what the data allows for, are there enough countries with available data for this.

nupur-parikh commented 1 year ago

As for the grouping, can you explain to me again what the drawbacks of using median values are? And what is best practice for an analysis like this, if there are any? I'm not sure I fully understand the pros/cons of doing something like this is?

jm-rivera commented 1 year ago

Is it possible to do a combo/all three of the following groups:

Africa (total)

country level for all African countries

Income group

I think some version of all three would be most helpful, but I'm not sure what the data allows for, are there enough countries with available data for this.

Yes I think that should be possible indeed. though see my next comment about some of the groups.

jm-rivera commented 1 year ago

As for the grouping, can you explain to me again what the drawbacks of using median values are? And what is best practice for an analysis like this, if there are any? I'm not sure I fully understand the pros/cons of doing something like this is?

The main issue with groups is one of missing data.

When looking at time series data, you want to make sure that the changes you observe over time are indeed true in the underlying data, and not created by things like missing data.

When looking at an individual country, it's easier to deal with things like that: either a data point is missing or it isn't. So you either show it or you don't (or you impute it somehow).

However, it is trickier when looking at groups. For example, your group may have 20 countries in it. And it may be that data for all 20 countries is available in several years. But it may also be that in other years, data is only available for (for example) 15 countries or so, and the 5 missing countries aren't the same every year. If we were to just add the 'available data' for the group and present it as totals for that group for each year, the danger would be that the amounts could be higher or lower simply because we sometimes have data for more or less countries.

There are a few strategies to get around that, but all of them have tradeoffs. Two main to consider:

interpolating/imputing missing data: For countries which have gaps in the data, we could impute or interpolate the values. This means that the group totals would be far less affected by gaps in the original data. But that also means that the totals become estimates that very much depend on how much data we're imputing and the methodology we're following
using a measure of central tendency for the group: Instead of calculating 'total' values for the group, we could look at ratios only (so share of gdp or per capita figures). By looking at something like the median, we're essentially identifying a country which is representative of the 'typical' level of spending for that group. The median has the advantage of being quite resilient to outliers and (if the groups are big enough) missing data. However, the drawback of something like that is that the resulting numbers are spending for the 'typical country' in the group, and that you cannot produce USD totals in that way.

Ultimately part of the decision is down to how much data is missing. If the time series data for each country is mostly complete, and there is data for almost all members of each group, then producing totals is definitely viable (and 'total' share of gdp or 'total' per capita figures are also possible), even if we have to do a little bit of interpolation. But if there are a lot of gaps in the time series data for each country, or if we only have data for some countries in a group, then it isn't the most methodologically sound thing to do to frame such numbers as actual totals (since so much of it would be either subject to imputations or simply missing)

nupur-parikh commented 1 year ago

Okay I think I understand. I'm familiar with interpolating/imputing missing data and your explanation makes sense with how we would use both interpolation and central tendencies.

A few follow-up questions so I make sure I fully understand:

If we want to look at how much countries spend in total, we would use the interpolating method for anything that is missing?
If we want to look at how much countries spend as % of GDP or per capita, we would use the median as the measure of central tendency?
If we want to look at (for example) Country X's spending on different diseases or services over time, as a % of health expenditure, we would use the median? And if we wanted to look at this in total USD, we would have to interpolate any missing data instead?

I think this page would mostly rely on health spending as a % of GDP or per capita, and maybe one or two data points in the key numbers section on total spending on health at a global level, if it's possible. In this case, I think going with the median values would make the most sense as we would be focusing mostly on % of GDP or per capita spending

jm-rivera commented 1 year ago

For your questions:

If looking at individual countries, I suggest no interpolation. Show the data that is there only. If looking at groups of countries, interpolate missing data (within some reasonable levels) to avoid introducing noise/fluctuations based on data availability
If looking at individual countries, use the actual value. If looking at groups of countries, use the median (unless we discover there is really good data coverage for all the groups we want to show (which I very much doubt)
If an individual country, the actual data. If a group, then the median for that group - interpolating as needed/possible.

nupur-parikh commented 1 year ago

Got it, thanks Jorge, this was extremely helpful for me to understand the different options. Based on what you've explained, I'm fine to use the median, unless by some surprise we discover there is really good data coverage!

ONEcampaign / topic_health_financing

Discussion: How should we present the data #8