A lot of time spent in `do_at_generic_first_appt_non_emergency` due to repeated dataframe access

matt-graham commented 10 months ago

In the latest scheduled profiling run, 21.4% of the overall run time was spent in calls to do_at_generic_first_appt_non_emergency. Looking at the breakdown of time spent inside the function, most of the time is spent in evauating _LocIndexer.__getitem__ (that is performing loc indexing operations on the population dataframe), both directly and as part of SymptomManager.has_what calls.

My understanding of this function is that it acts as an initial trigger for the scheduling of specific HSI events connected with different disease modules, but beyond scheduling further HSI events, should not immediately make any changes to the properties of the individual with the associated person_id ? @tbhallett can I check this is correct.

If this is the case, then we can avoid the repeated dataframe access by doing a single initial read of the individuals properties at the beginning of the function and then passing a datastructure containing these values to all called functions. Ideally we would make this datastructure immutable - for example created a namedtuple to be explicit that the properties are read only and cannot be updated. Given a lot of the dataframe accesses are part of SymptomManager.has_what calls, we would need to refactor this / provide an alternative method to allowing passing the properties directly (as for example a named tuple), and potentially also evaluate the symptoms the individual has once and pass in explicitly to called functions to avoid overhead in reconstructing the symptom list multiple times.

If we do require allowing direct updates to individual properties as part of these calls, and alternative to using an immutable data structure, would be to create a class which will allows changing property values, but records which property are changed and writes these back to the population dataframe at the end of the call.

A possibly nice way to do all of this and also simplify the structure of do_at_generic_first_appt_non_emergency (and analagously do_at_generic_first_appt_emergency) would be to add methods (defaulting to a no-op) to the base Module class which deal with scheduling the necessary follow-on HSI events from a generic first [non-emergency | emergency] appointment for each disease module, something like


class Module:

    ...

    def do_at_generic_first_appt_non_emergency(self, person_id: int, person_properties: namedtuple, symptoms: list[str]):
        """Schedule any HSI events following on from a non-emergency generic first appointment."""
        ...

    def do_at_generic_first_appt_emergency(self, person_id: int, person_properties: namedtuple, symptoms: list[str]):
        """Schedule any HSI events following on from an emergency generic first appointment."""
        ...

with corresponding functions in hsi_generic_first_appts module

def do_at_generic_first_appt_non_emergency(hsi_event, squeeze_factor):
    person_id = hsi_event.target
    person_properties = hsi_event.sim.population.get_individual_properties(person_id)
    symptoms = hsi_event.sim.modules['SymptomManager'].has_what_from_properties(person_properties )
    for module in hsi_event.sim.modules.values():
        module.do_at_generic_first_appt_non_emergency(person_id, person_properties, symptoms)

def do_at_generic_first_appt_emergency(hsi_event, squeeze_factor):
    person_id = hsi_event.target
    person_properties = hsi_event.sim.population.get_individual_properties(person_id)
    symptoms = hsi_event.sim.modules['SymptomManager'].has_what_from_properties(person_properties )
    for module in hsi_event.sim.modules.values():
        module.do_at_generic_first_appt_emergency(person_id, person_properties, symptoms)

which avoids the large number of if statements and module specific logic in the current implementation. Given it's passed to these function we could also pass through squeeze_factor to the modules methods if its thought this will be used at some point in future.

My main concern with this approach would be the difficulty in enforcing the condition that modules do_at_generic_first_appt_non_emergency and do_at_generic_first_appt_emergency only schedule HSI events and do not edit properties of the individual directly (which would invalidate the approach of reading the properties once at beginning of method). We could just rely on documenting this expectation and adding tests to check it is adhered to.

Alternatively at the cost of more complex function signatures, we could make do_at_generic_first_appt_non_emergency and do_at_generic_first_appt_emergency static methods of the modules, without direct access to the module attributes (including the population data frame). In this case we would need to pass through all objects that would be needed for dealing with scheduling following on HSI events as arguments.

tbhallett commented 10 months ago

My understanding of this function is that it acts as an initial trigger for the scheduling of specific HSI events connected with different disease modules, but beyond scheduling further HSI events, should not immediately make any changes to the properties of the individual with the associated person_id ? @tbhallett can I check this is correct.

100% correct

tbhallett commented 10 months ago

This sounds excellent! :-)

tamuri commented 10 months ago

~SymptomManager does have a have_what(person_ids) method which could be used to efficiently get all the symptoms for persons. It's using apply right now but can probably be made more efficient using native operations on the dataframe (maybe, flatten the symptom columns for persons of interest, change to True/False indicating >0, then collect into list...?)~ Hmm, misunderstood your comment. It's for a single individual.

willGraham01 commented 7 months ago

Hi all - in light of Matt's suggestions for reorganising and speeding up how the do_at_generic_first_appt_non_emergency function (and classes that are involved), I've opened #1289 which provides some typehints and a slight refactoring to hopefully make development on this issue slightly easier.

willGraham01 commented 7 months ago

Having taken a dive into the code, I think Matt's idea with some slight modifications is the fastest way to improving performance here.

The Module base class gets two new methods; do_at_generic_appt_{non_}emergency that have the same function signature, and should be overwritten by the disease modules that inherit from Module.


class Module:

  def do_at_generic_appt_non_emergency(
    target,
    read_only_details,
    symptoms,
    diagnosis_runner,
  ) -> hsi_events_to_schedule, hsi_event_scheduling_options, dict_of_df_updates

then the existing functions in the hsi_generic_first_appts module can be reworked as

def do_at_generic_first_appt_non_emergency(hsi_event):
  # Make all necessary dataframe reads into immutable types.
  # Do this once, at the start of the function so we don't repeatedly make DF accesses.

  # Track all changes to the DF that disease modules want to make
  all_wanted_df_updates = {}

  # Loop over all modules in the simulation to see if they want to schedule an event
  for disease_module in modules:
    events_to_schedule, event_scheduling_options, df_updates = module.do_at_generic_appt_non_emergency(...)

    # Schedule all events that were returned
    for event, options in zip(events_to_schedule, event_scheduling_options):
       schedule_hsi(event, **options)

    # Record, but do not yet implement, any DF changes
    all_wanted_df_updates |= df_updates

  # After the loop, make DF updates, then return.

Some comments on this:

Of the disease modules that do make changes to the DF, none of them change column values that the other modules use. So it should be safe to update everything at the end and avoid needing to re-read from the dataframe each iteration through the loop. However, the updates that are being made should probably only be going through when the HSI_Event.apply() method is called, rather than when they are scheduled? Would appreciate some insight into whether this is a grounded belief!
This is also a small step towards a more module-oriented codebase design. The somehwat-complex return signature of the Module-level function is necessary due to how the rest of the HealthSystem, HSI_Event, and supporting classes interact with each other. There's some scope to reduce this complexity with a refactoring if we wanted to - but if we just want to see a performance improvement, this is what we'll need to go with for the time being.
The functions in the hsi_generic_first_appts can actually be turned into class methods, since they are only called within the HSI_GenericNonEmergencyFirstAppt class and their single argument is passed as hsi_event=self.

Thoughts and questions are welcome - I will be experimenting with these changes (and requesting some profiling runs) on https://github.com/UCL/TLOmodel/pull/1292 to reflect what's here.

willGraham01 commented 7 months ago

Raising https://github.com/UCL/TLOmodel/issues/407 here which we're also addressing in this issue

UCL / TLOmodel

A lot of time spent in `do_at_generic_first_appt_non_emergency` due to repeated dataframe access #1237