`AssertionError` in `DiarrhoeaPollingEvent` due to probabilities summing to more than 1

matt-graham commented 3 years ago

While working on a fix for #381 I encountered an AssertionError with message Probabilities across columns cannot sum to more than 1.0 arising from the call to tlo.util.sample_outcome in the following lines in src/tlo/methods/diarrhoea.py

https://github.com/UCL/TLOmodel/blob/aa0578ca3ace49c1a13731f8e600a940fa116aaf/src/tlo/methods/diarrhoea.py#L1281-L1287

Specifically the per-pathogen probabilities which lead to this error are

rotavirus          0.320003
shigella           0.228597
adenovirus         0.121112
cryptosporidium    0.065714
campylobacter      0.195175
ETEC               0.459118
sapovirus          0.199254
norovirus          0.362912
astrovirus         0.112456
tEPEC              0.125281

As far as I can tell, while the probabilities in probs_of_aquiring_pathogen will be guaranteed to be individually in [0, 1] (as inc_of_acquiring_pathogen and self.fraction_of_a_year_until_next_polling_event are both positive so np.exp(-inc_of_acquiring_pathogen * self.fraction_of_a_year_until_next_polling_event) is in [0, 1]), there is nothing restricting their sum to be less than or equal to one.

It looks like the probability expression corresponds to something like modelling the event of acquiring a pathogen as a time-homogeneous Poisson process with the per-pathogen rates determined by inc_of_acquiring_pathogen. np.exp(-inc_of_acquiring_pathogen * self.fraction_of_a_year_until_next_polling_event) is then the probability of there being zero infection events in the interval defined by self.fraction_of_a_year_until_next_polling_event and so probability of at least one infection event occurring is one minus that probability?

If those are the modelling assumptions (quite possibly I'm wrong here however!) I think the current method of simulating events by selecting infection by a particular pathogen as an outcome with probabilities determined by the probability of an infection event for each pathogen in the time interval may be invalid. This seems to be assuming that the infection events are mutually exclusive (which would require the probabilities to sum to less than one) while it seems from the probability calculation the infection events are being modelled as following independent processes and so multiple infection events could occur in a time window.

tbhallett commented 3 years ago

Yes this is a bug.

By design we want only one infection to occur at a time, so we say that the risk of infection with each type of pathogen are mutually exclusive. But we have no guarantee that the sum of the probabilities of being infected within a time-step will be 1.0 or less.

I propose basically scaling the probabilities to within 1.0 with the addition of

probs = probs.apply(lambda row: (row / row.sum() if row.sum() >= 1.0 else row), axis=1)

Whole function:

def sample_outcome(probs: pd.DataFrame, rng: np.random.RandomState):
    """ Helper function to randomly sample an outcome for each individual in a population from a set of probabilities
    that are specific to each individual.
    :param probs: Each row of this dataframe represents the person and the columns are the possible outcomes. The
    values are the probability of that outcome for that individual. For each individual, the probabilities of each
    outcome are assumed to be independent and mutually exclusive (but not necessarily exhaustive). If they sum to more
    than 1.0, then they are (silently) scaled so that they do sum to 1.0.
    :param rng: Random Number state to use for the generation of random numbers.
    :return: A dict of the form {<index>:<outcome>} where an outcome is selected.
    """

    # Scaling to ensure that the sum in each row not exceed 1.0
    probs = probs.apply(lambda row: (row / row.sum() if row.sum() >= 1.0 else row), axis=1)

    assert (probs.sum(axis=1) <= 1.0).all(), "Probabilities across columns cannot sum to more than 1.0"

    # Compare uniform deviate to cumulative sum across columns, after including a "null" column (for no event).
    _probs = probs.copy()
    _probs['_'] = 1.0 - _probs.sum(axis=1)  # add implied "none of these events" category
    cumsum = _probs.cumsum(axis=1)
    draws = pd.Series(rng.rand(len(cumsum)), index=cumsum.index)
    y = cumsum.gt(draws, axis=0)
    outcome = y.idxmax(axis=1)

    # return as a dict of form {person_id: outcome} only in those cases where the outcome is one of the events.
    return outcome.loc[outcome != '_'].to_dict()

inesll commented 3 years ago

Hi @tbhallett I think this is a great fix! But if more accurate, I can look into it and have the incidence converted into odds so we can add OR instead of probabilities with RR. The incidence applied here is essentially a risk of occuring in 3 months timeframe (so a probability) not a rate - then it is easier to convert into odds. (But I might wrong) Please send me the formulas for the breakdown of the incidence rate (per child-year) you applied in the code, thanks!

matt-graham commented 2 years ago

Closing as this was fixed by #390

UCL / TLOmodel

`AssertionError` in `DiarrhoeaPollingEvent` due to probabilities summing to more than 1 #385