Consumables class not using simulation's seed

marghe-molaro commented 1 year ago

The consumables class appears not to be using the simulation's seed, leading to differences in outcomes when running an identical simulation twice with non-zero consumables. This became apparent in test test_healthsystem.py/test_manipulation_of_service_availability after introducing a random hsi queue (PR #881), where two runs of what should have been identical simulations led to different results for certain population sizes if consumables were not set to zero.

tbhallett commented 1 year ago

[Updated]

I created these test to explore this.

This suggests to me:

There isn't a problem with a lack of determinism in the Consumables module per se.
The use of randomise_queue doesn't work precisely in the way I expected (when consumables are 'default')

The routines that look for consumables availability seem to be scheduling new events for the same day. This causes them to be added to the queue, entailing another call on the module's RNG when we randomise the queue, but not when we don't randomise the queue.

So, the RNG is coming out of sync for the consumables calls, depending on whether we do / don't use randomise_queue.

I think the solution would have to a RNG that is dedicated to the queue randomisation stuff, so that use of it (or not) doesn't interfere with other uses of the RNG???

We had to do something similar in Contraception and did it like this at initialise_simulation:

        # Create second random number generator
        self.rng2 = np.random.RandomState(self.rng.randint(2 ** 31 - 1))

What do you think @marghe-molaro?

# This works
def test_determinism_of_hsi_that_run_and_consumables_availabilities(seed, tmpdir):
    """Check that two runs of model with the same seed gives the same sequence of HSI that run and the same state of
    the Consumables class at initiation."""

    def get_hsi_log_and_consumables_state() -> pd.DataFrame:
        """Return state of Consumables at the start of a simulation and the HSI_Event log that occur when running the
         simulation (when all services available)."""
        sim = Simulation(start_date=start_date, seed=seed, log_config={
            'filename': 'tmpfile',
            'directory': tmpdir,
            'custom_levels': {
                "tlo.methods.healthsystem": logging.DEBUG,
            }
        })
        sim.register(*fullmodel(resourcefilepath=resourcefilepath))
        sim.modules['HealthSystem'].parameters['Service_Availability'] = ["*"]
        sim.modules['HealthSystem'].parameters['cons_availability'] = 'default'
        sim.make_initial_population(n=1_000)

        # Initialise consumables and capture its state
        sim.modules['HealthSystem'].consumables.on_start_of_day(sim.date)

        consumables_state_at_init = dict(
            unknown_items=sim.modules['HealthSystem'].consumables._is_unknown_item_available,
            known_items=sim.modules['HealthSystem'].consumables._is_available,
            random_samples=list(sim.modules['HealthSystem'].consumables._rng.random_sample(100))
        )

        sim.simulate(end_date=start_date + pd.DateOffset(days=7))

        return {
            'consumables_state_at_init': consumables_state_at_init,
            'hsi_event': parse_log_file(sim.log_filepath, level=logging.DEBUG)['tlo.methods.healthsystem']['HSI_Event'],
        }

    first_run = get_hsi_log_and_consumables_state()

    # Check that all runs (with the same seed to simulation) are identical
    for _ in range(2):
        next_run = get_hsi_log_and_consumables_state()

        # - Consumables State at Initialisation
        assert next_run['consumables_state_at_init'] == first_run['consumables_state_at_init']

        # - HSI Events
        pd.testing.assert_frame_equal(next_run['hsi_event'], first_run['hsi_event'])

# This works
def test_service_availability_can_be_set_using_list_of_treatment_ids_expected_to_run(seed, tmpdir):
    """Check the two identical runs of model can be produced when the service_availability is set using ['*'] and when
     using the list of TREATMENT_IDs that occur during a run when the service_availability is set using ['*'].
     Here, there is no randomisation of the HSI Event queue."""

    def get_hsi_log(service_availability) -> pd.DataFrame:
        """Return the log of HSI_Events that occur when running the simulation with the `service_availability` set as
        indicated."""
        sim = Simulation(start_date=start_date, seed=seed, log_config={
            'filename': 'tmpfile',
            'directory': tmpdir,
            'custom_levels': {
                "tlo.methods.healthsystem": logging.DEBUG,
            }
        })
        sim.register(*fullmodel(resourcefilepath=resourcefilepath,
                                module_kwargs={'HealthSystem': {'randomise_queue': False}}))
        sim.modules['HealthSystem'].parameters['Service_Availability'] = service_availability
        sim.modules['HealthSystem'].parameters['cons_availability'] = 'default'
        sim.make_initial_population(n=500)

        sim.simulate(end_date=start_date + pd.DateOffset(days=7))

        return parse_log_file(sim.log_filepath, level=logging.DEBUG)['tlo.methods.healthsystem']['HSI_Event']

    # - when specifying service-availability as "*"
    run_with_asterisk = get_hsi_log(service_availability=["*"])

    # - when specifying service-availability as a list of TREATMENT_IDs
    run_with_list = get_hsi_log(service_availability=list(set(run_with_asterisk['TREATMENT_ID'])))

    # Check that HSI event logs are identical
    pd.testing.assert_frame_equal(run_with_asterisk, run_with_list)

# This does not work
def test_hsi_events_that_run_with_and_without_randomisation_are_as_expected(seed, tmpdir):
    """Check the two runs of model that are identical except for the option to randomise/not the HSI_Event queue
     generate a log of HSI Events that are different in the manner expected (i.e. the same events, but shuffled to
     occur in a different order in the day.)"""

    def get_hsi_log(randomise_queue) -> pd.DataFrame:
        """Return the log of HSI_Events that occur when running the simulation with the `service_availability` set as
        indicated."""
        sim = Simulation(start_date=start_date, seed=seed, log_config={
            'filename': 'tmpfile',
            'directory': tmpdir,
            'custom_levels': {
                "tlo.methods.healthsystem": logging.DEBUG,
            }
        })
        sim.register(*fullmodel(resourcefilepath=resourcefilepath,
                                module_kwargs={
                                    'HealthSystem': {
                                        'randomise_queue': randomise_queue,
                                        'mode_appt_constraints': 0,
                                        'use_funded_or_actual_staffing': 'funded_plus',
                                    }
                                },
                                )
                     )
        sim.modules['HealthSystem'].parameters['Service_Availability'] = ['*']
        sim.modules['HealthSystem'].parameters['cons_availability'] = 'default'
        sim.make_initial_population(n=500)

        sim.simulate(end_date=start_date + pd.DateOffset(days=7))

        return parse_log_file(sim.log_filepath, level=logging.DEBUG)['tlo.methods.healthsystem']['HSI_Event']

    # Make two runs that differ only in whether the HSI event queue is randomised.
    run1 = get_hsi_log(randomise_queue=False)
    run2 = get_hsi_log(randomise_queue=True)

    # We expect that the same HSI event should on each day, but in a different order. Check this, by imposing a sort
    # on events and checking that this makes the logs identical.

tbhallett commented 1 year ago

That does indeed fix my tests. The commits for last text and fix is here.

Or, the branch with all the tests I added here, with builds on top of #881.

@marghe-molaro -- please could you take a look and, if happy, merge it into your PR #881?

marghe-molaro commented 1 year ago

@tbhallett I'm not sure I completely understand this interference.

The routines that look for consumables availability seem to be scheduling new events for the same day. This causes them to be added to the queue, entailing another call on the module's RNG when we randomise the queue, but not when we don't randomise the queue.

So, the RNG is coming out of sync for the consumables calls, depending on whether we do / don't use randomise_queue.

Yes, if we randomise the queue this leads to additional calls on the RNG; however the issue is not that outcomes with/without the queue randomisation are different, as that should be expected (and desired!), but rather than when comparing two simulations with rand on and the same seed, these are also different. In other words, yes there are additional calls on the RNG, but why aren't the additional calls deterministic? Why does this "interference" change when repeating identical runs, when it should be the same?

I see why giving the consumables a different seed would avoid the issue, but I am not sure I understand what the root problem is, and whether this fix is addressing it. But I might be missing something! I will have a closer look at your tests.

tbhallett commented 1 year ago

Good point @marghe-molaro

The "root cause" -- I think -- is that the first run (when service_availability = ['']) causes some HSI to be scheduled for after the end of the simulation-- so they are scheduled (incurring a hit on the RNG for the his-queue) but not run. In the second run, we set service_availability = a list of the TREATMENT_IDs that actually did run. This means that some HSI are rejected and not scheduled (thereby not incurring a hit on the RNG for the hsi-queue). In this way, the two runs get to different points in the RNG for the hsi-queue. (This issue was always there but undected before we introduced randomisation to the hsi queue (because, without that, not being scheduled and not running during the simulation were identical). From your initial investigations, it looked* like it was to do with the consumables, but this was only because the consumables class was using the the same RNG as the his-queue. When we separated them it become clearer.

I added some more commits on to the branch I noted above that augment some of the comments and also separate out the RNG that is used for determining the outcomes of diagnostics as well (to help with any future issues.)

Does that sound OK to you?

marghe-molaro commented 1 year ago

Thanks for explaining @tbhallett, the additional scheduled-but-not-run hits on the RNG in the first simulation make perfect sense and the root cause of this!! I think all your proposed changes make sense and indeed should avoid any more headaches like this in the future.

UCL / TLOmodel

Consumables class not using simulation's seed #962