Closed matt-graham closed 2 years ago
Yes this is a bug.
By design we want only one infection to occur at a time, so we say that the risk of infection with each type of pathogen are mutually exclusive. But we have no guarantee that the sum of the probabilities of being infected within a time-step will be 1.0 or less.
I propose basically scaling the probabilities to within 1.0 with the addition of
probs = probs.apply(lambda row: (row / row.sum() if row.sum() >= 1.0 else row), axis=1)
Whole function:
def sample_outcome(probs: pd.DataFrame, rng: np.random.RandomState):
""" Helper function to randomly sample an outcome for each individual in a population from a set of probabilities
that are specific to each individual.
:param probs: Each row of this dataframe represents the person and the columns are the possible outcomes. The
values are the probability of that outcome for that individual. For each individual, the probabilities of each
outcome are assumed to be independent and mutually exclusive (but not necessarily exhaustive). If they sum to more
than 1.0, then they are (silently) scaled so that they do sum to 1.0.
:param rng: Random Number state to use for the generation of random numbers.
:return: A dict of the form {<index>:<outcome>} where an outcome is selected.
"""
# Scaling to ensure that the sum in each row not exceed 1.0
probs = probs.apply(lambda row: (row / row.sum() if row.sum() >= 1.0 else row), axis=1)
assert (probs.sum(axis=1) <= 1.0).all(), "Probabilities across columns cannot sum to more than 1.0"
# Compare uniform deviate to cumulative sum across columns, after including a "null" column (for no event).
_probs = probs.copy()
_probs['_'] = 1.0 - _probs.sum(axis=1) # add implied "none of these events" category
cumsum = _probs.cumsum(axis=1)
draws = pd.Series(rng.rand(len(cumsum)), index=cumsum.index)
y = cumsum.gt(draws, axis=0)
outcome = y.idxmax(axis=1)
# return as a dict of form {person_id: outcome} only in those cases where the outcome is one of the events.
return outcome.loc[outcome != '_'].to_dict()
Hi @tbhallett I think this is a great fix! But if more accurate, I can look into it and have the incidence converted into odds so we can add OR instead of probabilities with RR. The incidence applied here is essentially a risk of occuring in 3 months timeframe (so a probability) not a rate - then it is easier to convert into odds. (But I might wrong) Please send me the formulas for the breakdown of the incidence rate (per child-year) you applied in the code, thanks!
Closing as this was fixed by #390
While working on a fix for #381 I encountered an
AssertionError
with messageProbabilities across columns cannot sum to more than 1.0
arising from the call totlo.util.sample_outcome
in the following lines insrc/tlo/methods/diarrhoea.py
https://github.com/UCL/TLOmodel/blob/aa0578ca3ace49c1a13731f8e600a940fa116aaf/src/tlo/methods/diarrhoea.py#L1281-L1287
Specifically the per-pathogen probabilities which lead to this error are
As far as I can tell, while the probabilities in
probs_of_aquiring_pathogen
will be guaranteed to be individually in[0, 1]
(asinc_of_acquiring_pathogen
andself.fraction_of_a_year_until_next_polling_event
are both positive sonp.exp(-inc_of_acquiring_pathogen * self.fraction_of_a_year_until_next_polling_event)
is in[0, 1]
), there is nothing restricting their sum to be less than or equal to one.It looks like the probability expression corresponds to something like modelling the event of acquiring a pathogen as a time-homogeneous Poisson process with the per-pathogen rates determined by
inc_of_acquiring_pathogen
.np.exp(-inc_of_acquiring_pathogen * self.fraction_of_a_year_until_next_polling_event)
is then the probability of there being zero infection events in the interval defined byself.fraction_of_a_year_until_next_polling_event
and so probability of at least one infection event occurring is one minus that probability?If those are the modelling assumptions (quite possibly I'm wrong here however!) I think the current method of simulating events by selecting infection by a particular pathogen as an outcome with probabilities determined by the probability of an infection event for each pathogen in the time interval may be invalid. This seems to be assuming that the infection events are mutually exclusive (which would require the probabilities to sum to less than one) while it seems from the probability calculation the infection events are being modelled as following independent processes and so multiple infection events could occur in a time window.