Closed lagerros closed 4 years ago
I would prefer if this logic did not end up in the frontend. Can epimodel accommodate this @gavento?
This seems quite expensive and complex to implement and I would advise against it. Also I doubt the benefits.
@lagerros
New custom regions that consist of arbitrary other regions could be added fairly easily through a join table (or join CSV... we really should think about getting an actual DB at some point). This would not break the existing region model since it would just be tacked onto it. I would feel dirty about doing this sort of thing except that it really is the way I would choose to design the whole region model if I were to refactor it.
However, I'm somewhat hesitant to do this since @gavento thinks it will be misleading. I don't really understand why though. Would someone have a second to explain this to me?
I think @gavento is wrong about that.
1) This would still be showing multiple trajectories, including uncertainty. Use case is ~the same. We're modelling rough trajectories in regions with tens of millions of inhabitants over many months; and not using it for short-term quantitative modelling. 2) We have good input data for this country. 3) We are in close contact with the decision-makers so can make sure things are interpreted appropriately. 4) Final outputs would be sanity checked before being used.
By default, this solution would only be used in our custom modelling; and the plan is not to use in production on the website.
I think @gavento attached too much significance to the numbers used in Connor's weighted combination; where I expect a rough estimate to still be fine.
We've run into a bit of an issue tabulating population for weighting the linear averages. It appears that the population data in regions-gleam.csv
does not remotely line up with the population numbers in the spreadsheet. The Gleam pop is on average only 17% of the spreadsheet population, but the ratio varies wildly or I would chalk it up to some scale factor.
The spreadsheet numbers are clearly the more accurate ones (and more complete - 8/29 Gleam regions do not have population data at all), but I'm not sure what the Gleam numbers represent, so I'm hesitant to just ignore them when calculating the weights.
@lagerros, @gavento do you have any insight into this?
Sorry, we should have pushed our updated regions-gleam.csv
which is consistent with the spreadsheet.
The spreadsheet numbers are the authoritative ones; the gleam ones are a bit outdated.
Would someone have a second to explain this to me?
My understanding: The gleam traces do not quantitatively represent the amount of covid cases. The reason for this is that 'active' really means 'infectious' and this is modelled as just a few days. After these days, even though people are still sick, they don't really infect a lot of others anymore so they already moved to recovered
(which should really be called removed
for this reason).
It would be very helpful to lookup a onset-to-outcome distribution and convolute it with the daily new cases and add this to epimodel and the frontend. If I understand it correctly this is also Jan's main complaint about using the "healthcare capacity" line (apart from proper labeling/communication about what the line means).
If what @mathijshenquet says is true, doesn't that make the entire curve misleading? People are going to intuitively look at those lines and think, "This is the percentage of the population who are actively infected, regardless of whether they are infecting others".
I thought the hospital line was ditched due to a combination of it having to be scaled to match what the line means (and thus looking incorrect to people who are familiar with hospital availability), and the difficulty of assessing the actual capacity. (This is a fairly uninformed opinion, as I haven't read up on that discussion in a while.)
@lagerros, could you push the updated regions.csv
and regions-gleam.csv
to master so that everyone can use the most up-to-date data?
@lagerros, we need the .hdf
file for Pakistan.
hdf file sent on slack
To @mathijshenquet , I think the y-axis should say "Active spreaders" rather than "Active infection".
It shows the people in time-window when they are most infectious -- at the end of pre-symptomatic period and beginning of symptomatic period before isolation.
I pushed a commit a while back that changed the word, but for a mysterious reasons it later went back to "Active infections" (think in a new commit by Mathijs).
@wolverdude if you could change that wording would be awesome.
@lagerros That's a frontend change and not really part of this ticket. I probably won't have any more time to work on stuff today, so just create a ticket for it.
The work of creating custom regions and averaging their traces on web export is done. I'm not merging the code yet because it has a different config.yaml checked in, but you can find my branch here.
I only averaged the model traces though, not any of the Foretold or Johns Hopkins data. I don't think those are probably relevant in this case though, since I've only aggregated Gleam regions.
I can rebase just the non-config changes onto master if that's helpful, but it'll have to wait until tomorrow.
Let me know if I can be any of help @wolverdude , what you got in there seems pretty solid :+1: .
It was unclear in the spec what would be used for the weights when averaging the model traces. I just assumed population instead of clarifying. Turns out it's the Factor:
columns in the spreadsheet. I'll need to do a minor refactor in order to make this work, but I think I can manage it.
I added model_weights
into the custom region config, and these are now being used to average the model traces. https://github.com/epidemics/epimodel/tree/wolverdude/pakistan-provinces
I've confirmed that it works on the frontend.
Nice!
Questions:
1) Why is only one child listed here, compared to three model_weights?
2) Same comment as in Slack: the average should use both population and model weight. (Each province is composed of a few regions, but only part of those regions. So we want weigh them both by their size in comparison to each other and how much of them form part of the province)
I merely copied the model_weights
directly from the Factor: AJK
column in the spreadsheet. They didn't match up exactly with the child regions, which is why I created a separate key.
I am very confused now how you want these provinces to be weighted. Please specify precisely what formula should be used to compute the weights and how to know which gleam basins should be used in the weighting.
Let's use AJK as an example.
Weights are: [0.02, 0.08, 1]
and populations are [8500000, 15300000, 4100000]
.
Transform the population to relative populations by dividing all of them by 8500000+15300000+4100000
.
This gives population weights:
[0.305, 0.548, 0.147]
.
Assume for each of those regions there are traces [X, Y, Z]
of a time-series of "Infected" in the extdata
files.
The new file provincial AJK traces should compute:
[0.02*0.305, 0.08*0.548, 1*0.147]
The resulting array should then be renormalised.
And finally doing a dotproduct with [X, Y, Z]
.
@lagerros There's still the issue that Province of city
and Factor: AJK
don't match up. Should I just be ignoring the Province of city
column completely and using only Factor: X
instead?
Yes! Just use the Factor
column.
This could have been a lot clearer; my bad.
Moving to ToDo - it's not actively developed.
We have forecasts for ~30 different regions, but we want to turn a weighted linear combination of the traces into forecasts for each of the 7 distinct administrative units.
Connor has provided the relevant weights here, in the "Region to Province map": https://docs.google.com/spreadsheets/d/1IxPMadPxjnphWSKG_6PxmsrCLoXe3cHGp1Ok9kcddPk/edit#gid=1378327731 (you will need to be granted access)
Now we just need some script for making the aggregation. If it could be added to epimodel, in reused in future when modelling novel provinces in countries, that could be great.
Basically, the output of the script should be an additional set of JSON-files in the web-export, labelled "extdata-[add PROVINCE_NAME].json" but where the infected/recovered/active numbers are just a weighted sum of the numbers for the relevant regions. (And the "Statistics" should be recomputed for the new traces.)