Generate provincial forecasts for Pakistan

lagerros commented 4 years ago

We have forecasts for ~30 different regions, but we want to turn a weighted linear combination of the traces into forecasts for each of the 7 distinct administrative units.

Connor has provided the relevant weights here, in the "Region to Province map": https://docs.google.com/spreadsheets/d/1IxPMadPxjnphWSKG_6PxmsrCLoXe3cHGp1Ok9kcddPk/edit#gid=1378327731 (you will need to be granted access)

Now we just need some script for making the aggregation. If it could be added to epimodel, in reused in future when modelling novel provinces in countries, that could be great.

Basically, the output of the script should be an additional set of JSON-files in the web-export, labelled "extdata-[add PROVINCE_NAME].json" but where the infected/recovered/active numbers are just a weighted sum of the numbers for the relevant regions. (And the "Statistics" should be recomputed for the new traces.)

mathijshenquet commented 4 years ago

I would prefer if this logic did not end up in the frontend. Can epimodel accommodate this @gavento?

gavento commented 4 years ago

This seems quite expensive and complex to implement and I would advise against it. Also I doubt the benefits.

Most importantly: Gleam - and our simple SEIR model - may not be as useful for small-scale quantitative modeling to warrant this accuracy. Both are very rough tools intended for qualitative forecasts with high uncertainty, redistributing 5% only adds the impression of precision.
- I am somewhat skeptical we will have this redistribution data for many other cases, or about their accuracy (unless they come from e.g. the Gleam developers).
This seems specific to Pakistan and if implemented in a generic way, would bloat the codebase significantly.
- My impression is that this does not add much real value to the reports, we should aim more for 80/20.
We do not have those regions, adding them would require a new region level, would break the assumption that regions are trees etc.

@lagerros

wolverdude commented 4 years ago

New custom regions that consist of arbitrary other regions could be added fairly easily through a join table (or join CSV... we really should think about getting an actual DB at some point). This would not break the existing region model since it would just be tacked onto it. I would feel dirty about doing this sort of thing except that it really is the way I would choose to design the whole region model if I were to refactor it.

However, I'm somewhat hesitant to do this since @gavento thinks it will be misleading. I don't really understand why though. Would someone have a second to explain this to me?

lagerros commented 4 years ago

I think @gavento is wrong about that.

1) This would still be showing multiple trajectories, including uncertainty. Use case is ~the same. We're modelling rough trajectories in regions with tens of millions of inhabitants over many months; and not using it for short-term quantitative modelling. 2) We have good input data for this country. 3) We are in close contact with the decision-makers so can make sure things are interpreted appropriately. 4) Final outputs would be sanity checked before being used.

By default, this solution would only be used in our custom modelling; and the plan is not to use in production on the website.

I think @gavento attached too much significance to the numbers used in Connor's weighted combination; where I expect a rough estimate to still be fine.

wolverdude commented 4 years ago

We've run into a bit of an issue tabulating population for weighting the linear averages. It appears that the population data in regions-gleam.csv does not remotely line up with the population numbers in the spreadsheet. The Gleam pop is on average only 17% of the spreadsheet population, but the ratio varies wildly or I would chalk it up to some scale factor.

The spreadsheet numbers are clearly the more accurate ones (and more complete - 8/29 Gleam regions do not have population data at all), but I'm not sure what the Gleam numbers represent, so I'm hesitant to just ignore them when calculating the weights.

@lagerros, @gavento do you have any insight into this?

lagerros commented 4 years ago

Sorry, we should have pushed our updated regions-gleam.csv which is consistent with the spreadsheet.

The spreadsheet numbers are the authoritative ones; the gleam ones are a bit outdated.

mathijshenquet commented 4 years ago

Would someone have a second to explain this to me?

My understanding: The gleam traces do not quantitatively represent the amount of covid cases. The reason for this is that 'active' really means 'infectious' and this is modelled as just a few days. After these days, even though people are still sick, they don't really infect a lot of others anymore so they already moved to recovered (which should really be called removed for this reason).

It would be very helpful to lookup a onset-to-outcome distribution and convolute it with the daily new cases and add this to epimodel and the frontend. If I understand it correctly this is also Jan's main complaint about using the "healthcare capacity" line (apart from proper labeling/communication about what the line means).

wolverdude commented 4 years ago

If what @mathijshenquet says is true, doesn't that make the entire curve misleading? People are going to intuitively look at those lines and think, "This is the percentage of the population who are actively infected, regardless of whether they are infecting others".

I thought the hospital line was ditched due to a combination of it having to be scaled to match what the line means (and thus looking incorrect to people who are familiar with hospital availability), and the difficulty of assessing the actual capacity. (This is a fairly uninformed opinion, as I haven't read up on that discussion in a while.)

wolverdude commented 4 years ago

@lagerros, could you push the updated regions.csv and regions-gleam.csv to master so that everyone can use the most up-to-date data?

wolverdude commented 4 years ago

@lagerros, we need the .hdf file for Pakistan.

lagerros commented 4 years ago

hdf file sent on slack

To @mathijshenquet , I think the y-axis should say "Active spreaders" rather than "Active infection".

It shows the people in time-window when they are most infectious -- at the end of pre-symptomatic period and beginning of symptomatic period before isolation.

I pushed a commit a while back that changed the word, but for a mysterious reasons it later went back to "Active infections" (think in a new commit by Mathijs).

@wolverdude if you could change that wording would be awesome.

wolverdude commented 4 years ago

@lagerros That's a frontend change and not really part of this ticket. I probably won't have any more time to work on stuff today, so just create a ticket for it.

wolverdude commented 4 years ago

The work of creating custom regions and averaging their traces on web export is done. I'm not merging the code yet because it has a different config.yaml checked in, but you can find my branch here.

I only averaged the model traces though, not any of the Foretold or Johns Hopkins data. I don't think those are probably relevant in this case though, since I've only aggregated Gleam regions.

wolverdude commented 4 years ago

I can rebase just the non-config changes onto master if that's helpful, but it'll have to wait until tomorrow.

hnykda commented 4 years ago

Let me know if I can be any of help @wolverdude , what you got in there seems pretty solid :+1: .

wolverdude commented 4 years ago

It was unclear in the spec what would be used for the weights when averaging the model traces. I just assumed population instead of clarifying. Turns out it's the Factor: columns in the spreadsheet. I'll need to do a minor refactor in order to make this work, but I think I can manage it.

wolverdude commented 4 years ago

I added model_weights into the custom region config, and these are now being used to average the model traces. https://github.com/epidemics/epimodel/tree/wolverdude/pakistan-provinces

wolverdude commented 4 years ago

I've confirmed that it works on the frontend.

Screen Shot 2020-04-30 at 6 14 32 PM

lagerros commented 4 years ago

Nice!

Questions:

1) Why is only one child listed here, compared to three model_weights?

2) Same comment as in Slack: the average should use both population and model weight. (Each province is composed of a few regions, but only part of those regions. So we want weigh them both by their size in comparison to each other and how much of them form part of the province)

wolverdude commented 4 years ago

I merely copied the model_weights directly from the Factor: AJK column in the spreadsheet. They didn't match up exactly with the child regions, which is why I created a separate key.
I am very confused now how you want these provinces to be weighted. Please specify precisely what formula should be used to compute the weights and how to know which gleam basins should be used in the weighting.

lagerros commented 4 years ago

Let's use AJK as an example.

Weights are: [0.02, 0.08, 1] and populations are [8500000, 15300000, 4100000].

Transform the population to relative populations by dividing all of them by 8500000+15300000+4100000.

This gives population weights: [0.305, 0.548, 0.147].

Assume for each of those regions there are traces [X, Y, Z] of a time-series of "Infected" in the extdata files.

The new file provincial AJK traces should compute:

[0.02*0.305, 0.08*0.548, 1*0.147]

The resulting array should then be renormalised.

And finally doing a dotproduct with [X, Y, Z].

wolverdude commented 4 years ago

@lagerros There's still the issue that Province of city and Factor: AJK don't match up. Should I just be ignoring the Province of city column completely and using only Factor: X instead?

lagerros commented 4 years ago

Yes! Just use the Factor column.

This could have been a lot clearer; my bad.

hnykda commented 4 years ago

Moving to ToDo - it's not actively developed.

epidemics / covid

Generate provincial forecasts for Pakistan #426