Investigate large number of year one non-disease deaths in NNMM model

krosenfeld-IDM commented 3 months ago

On commit 9a1ba81 we're seeing non disease mortality spike at the beginning of the simulation:

This may be due to dod not being offest by current tick as agents are activated:

model.population.dod[istart:iend] = pdsod(model.population.dob[istart:iend], max_year=100)   # make use of the fact that dob[istart:iend] is currently 0

see https://github.com/InstituteforDiseaseModeling/laser/blob/clorton/cleanup-for-merge/nnmm/measles.ipynb

clorton commented 3 months ago

I am still seeing this behavior even with the fix in do_births().

KevinMcCarthyAtIDM commented 3 months ago

My suggestion - can you write out the agent ages and dates of death on Day 1? Finding this bug I think would be helped by knowing who is drawing those deaths?

One potential clue - First year-of-life mortality is pretty much the highest single year. I think in the Nigeria survival curve, some 12% of kids die before the age of 1.
So most obvious potential culprit that stems from this - for most ages, $S{a-1}^{a} \approx S{a}^{a+1}$. So, e.g.,to figure out whether an agent dies in the next year, if an agent is 30 years and 183 days old, it doesn't really matter if you use $S{30}^{31}$ or $S{31}^{32}$, or half a year of exposure to the first mortality rate and half a year to the second, you'll get it approximately right either way. But! $S{0}^{1} \not\approx S{1}^{2}$. So depending on how the draw is being handled, if we're not offsetting the "mortality exposure" for sub annual age, a kid who's 11 months old could be being exposed to a full year of the 0->1 mortality rate, which will end up with way too many deaths.

If that isn't the problem, there's a potentially more subtle one. Most of that 0->1 mortality actually occurs in the first day, first week, and first month of life. Depending on how we initialize, this could end up with "too many" kids under 12 months at initialization, also being exposed to that high mortality rate.
What I mean is - if 100k kids are born, and 88k survive to year 1, then using the annual averages to initialize age would tell you there should be about 94k 0->1 year olds at any time. But reality is probably that there are more like 89 or 90k, because that mortality is so stacked into the first month. This one's a bit more subtle, I don't know if it should cause that spike vs. just being "wrong" if someone cared about sub-annual ages. But if the first idea above doesn't fix the issue, this is where I would look next. We can figure out how to get the appropriate numbers for day 1, day2-7, day 8-30, and day 31-364 survival, and I think the date of death code would have to be updated to not assume annual binning but take a vector of ages that it's interpolating between. Which is something that should eventually happen anyway but may not need to happen now unless it specifically fixes this bug

krosenfeld-IDM commented 3 months ago

That's a good point @KevinMcCarthyAtIDM. I think we're using US statistics for this notebook https://github.com/InstituteforDiseaseModeling/laser/blob/e6b00057ead082c8f38e3daf2fa0c6f97d4bb58a/src/idmlaser/kmcurve.py#L6

so the effect should be even larger if we switch to NGA statistics. I have what should be the equivalent table in the GEOMED repository if you want to try that @clorton

krosenfeld-IDM commented 3 months ago

I'm seeing the same effect in the current GEOMED version (https://github.com/gatesfoundation/GEOMED24/commit/0dea1e11d26ded21eda466da37e82b17cc67e72f) although the yearly deaths seem to rise linearly:

https://github.com/gatesfoundation/GEOMED24/commit/0dea1e11d26ded21eda466da37e82b17cc67e72f

I think it's likely that the first year will be part of a burn-in anyway and the effect goes away after that. But good to know about and glad it is documented here! Nice catch @KevinMcCarthyAtIDM .

clorton commented 3 months ago

I would expect yearly deaths to increase as the population increases and is not offset by any increasing average lifespan.

KevinMcCarthyAtIDM commented 3 months ago

I would expect yearly deaths to increase as the population increases and is not offset by any increasing average lifespan.

Agreed, that's normal. But the spike in year 1 isn't. This is probably not a breaking bug for Katherine's work, since it disappears after burn-in. But also of course I'd feel better making sure we understand it. I think the quickest path is looking at the age distributions of all agents who die in Year 1 vs. all agents who die in some later year. The explanations I gave above would imply that the histogram in Year 1 should have more kids in the 1-2 year age bin than later years. If we see that, great, we understand it enough to know it's not the most critical fix for now; but if we don't see that, then I'm confused and want to make sure it's not something weirder...

krosenfeld-IDM commented 3 months ago

Here is what I'm seeing for the age distributions @KevinMcCarthyAtIDM comparing the first year of the sim and the rest (3 year sim)

https://github.com/gatesfoundation/GEOMED24/blob/features/notebooks/measles.py Note that I switched to using the same code as @clorton to reduce confusion.

Sorry - these are infections! Updating now for deaths...

krosenfeld-IDM commented 3 months ago

ok - here are the deaths as measured by:

https://github.com/gatesfoundation/GEOMED24/blob/2cadae686ee80f1967eb5e7a620dd1b00a939540/notebooks/measles.py#L633

Distributions look fairly similar.

krosenfeld-IDM commented 3 months ago

Updating with NGA deaths and age pyramid (and plotting density for readability)

https://github.com/gatesfoundation/GEOMED24/blob/922c64731e6bc3de4cfb1b6767479da9a4ca949a/notebooks/measles.py

InstituteforDiseaseModeling / laser

Investigate large number of year one non-disease deaths in NNMM model #28