[BUG] `update_call` assigns work to officers with no time available

willGraham01 commented 1 year ago

The current version of the model throws a KeyError when attempting to run the scale_run profiling script.

Discovered after returning to issue #686's profiling PR (#1012) and the error being raised after a rebase onto aace4d1.

To Reproduce:

In a terminal at the root of the TLOmodel repository:

$ mamba create -y -n tlo-test python=3.8
$ mamba activate tlo-test
$ (tlo-test) git status
On branch 'master'
Your branch is up-to-date with 'origin/master'.
$ (tlo-test) pip install -r requirements/dev.txt
...
$ (tlo-test) pip install -e .
$ (tlo-test) python src/scripts/profiling/scale_run.py <PARAMETERS HERE>
  File "src/scripts/profiling/scale_run.py", line 201, in <module>
    sim.simulate(end_date=end_date)
  File "/home/ccaegra/Documents/TLO/TLOmodel/src/tlo/simulation.py", line 234, in simulate
    self.fire_single_event(event, date)
  File "/home/ccaegra/Documents/TLO/TLOmodel/src/tlo/simulation.py", line 279, in fire_single_event
    event.run()
  File "/home/ccaegra/Documents/TLO/TLOmodel/src/tlo/events.py", line 66, in run
    self.apply(self.target)
  File "/home/ccaegra/Documents/TLO/TLOmodel/src/tlo/methods/healthsystem.py", line 2607, in apply
    self.process_events_mode_2(hold_over)
  File "/home/ccaegra/Documents/TLO/TLOmodel/src/tlo/methods/healthsystem.py", line 2459, in process_events_mode_2
    set_capabilities_still_available.remove(officer)
KeyError: 'FacilityID_91_Officer_Pharmacy'

The parameters passed to the scale_run.py script to reproduce this bug are:

python src/scripts/profiling/scale_run.py \
    --years 0 \
    --months 1 \
    --initial-population 50000 \
    --tlo-dir . \
    --output-dir ./outputs \
    --log-filename scale_run_profiling \
    --log-level DEBUG \
    --show-progress-bar \
    --seed 0 \
    --mode-appt-constraints 2

and the KeyError is thrown approximately 35% of the way through the simulation.

Other Information

This bug is not present on commit 267c1e85f, so was introduced at some point between then and aace4d1.

git bisect implies that 73429a738a86d618dea67cbe48155a7959d72744 is the first bad commit (Tuesday 15th Aug 2023).

willGraham01 commented 1 year ago

Issue seems to be that within process_events_mode_2; the capabilities_monitor variable is updated using the subtract method. Jump to the file with changes here.

This method does not remove keys with negative values, which is the purpose of the for-loop that comes next. However capabilities_monitor persists between loop iterations; meaning that next iteration the same key may still have a negative value, and we will attempt to remove it again from set_capabilities_still_available where it doesn't exist, hence a KeyError.

This looks to be the cause; output of my debug log:

FacilityID_99_Officer_Pharmacy is in set_capabilities
  updated_call is dict_items([('FacilityID_99_Officer_Clinical', 33.0), ('FacilityID_99_Officer_Nursing_and_Midwifery', 10.0), ('FacilityID_99_Officer_Pharmacy', 11.900000000000002)])
    capabilities_monitor[FacilityID_99_Officer_Pharmacy] = 3.204738184694463 (was 15.104738184694465)
  updated_call is dict_items([('FacilityID_99_Officer_Clinical', 33.0), ('FacilityID_99_Officer_Nursing_and_Midwifery', 10.0), ('FacilityID_99_Officer_Pharmacy', 11.900000000000002)])
    capabilities_monitor[FacilityID_99_Officer_Pharmacy] = -8.695261815305539 (was 3.204738184694463)
    Removing FacilityID_99_Officer_Pharmacy == FacilityID_99_Officer_Pharmacy from set_capabilities
  updated_call is dict_items([('FacilityID_99_Officer_Clinical', 28.0), ('FacilityID_99_Officer_Nursing_and_Midwifery', 16.0), ('FacilityID_99_Officer_Pharmacy', 6.0), ('FacilityID_99_Officer_Radiography', 18.0)])
    capabilities_monitor[FacilityID_99_Officer_Pharmacy] = -14.695261815305539 (was -8.695261815305539)
    Removing FacilityID_99_Officer_Pharmacy == FacilityID_99_Officer_Pharmacy from set_capabilities
<ERROR>

And in particular on the loop iteration that throws the error:

== New while loop iteration ==
 FacilityID_99_Officer_Pharmacy has capacity -8.695261815305539 at start of iteration
 updated_call wants to assign 6.0 time to FacilityID_99_Officer_Pharmacy
 FacilityID_99_Officer_Pharmacy now has capacity -14.695261815305539
 Removing FacilityID_99_Officer_Pharmacy from set_capabilities
<ERROR>

update_call is still assigning work to this particular officer, despite the fact that they have no more time left.

Suggestions for the cause:

The actual_appt_footprint which then produces update_call is not not being told about the officers that are no longer available after resolving the latest event at the front of the queue. Thus, it thinks it can assign extra time to an officer it thinks still has capacity, but no longer does.
The model might be allocating resources incorrectly. Note that ID_99 has several of their functions/roles updated at once in the final step. It might be possible that the model is only checking for whether 99_Officer_Radiography has time left to apply a treatment, but is then assuming that ID_99 has time across all of their functions to apply said treatment. This is supported by the fact that at the point of erroring, the ID_99_Officer_Radiography has around 8 "time" left, and is then assigned a further 18 "time" (which is allowed due to overtime). However, the update_call then also wants to assigned additional time to ID_99 for Clinical, Nursing_and_Midwifery, and Pharmacy (which is the value already below 0).

tamuri commented 1 year ago

Pinging @marghe-molaro !

marghe-molaro commented 1 year ago

Hi @willGraham01, indeed, the issue appears to be that the "actual" appointment footprint - which can only be calculated when actually running the event - at times relies on a different set of medical officers than those assumed by the "expected" footprint; in the case where one of these officers already had exhausted capabilities before running the event (in which case the event shouldn't have been run at all!) this throws an error when trying to remove that officer (again) from the set_capabilities_still_available.

This constitutes a fundamental problem (rather than a bug) for mode 2, because by definition the actual footprint can only be obtained by running the event, and therefore we cannot determine prior running it whether enough medical capabilities were available to run it in the first place. I would need to discuss with @tbhallett what the strategy should be moving forward (there might be very few instances where the actual footprint requires an extra officer, in which case we could maybe ignore the issue of capabilities for those HSIs, but if a significant number of them do this we might need to re-evaluate how mode 2 is computed significantly).

Long story short, @willGraham01 and @tamuri - for a quick fix to continue with the scale run analysis you could add a safety if statement on line 2458 in healthsystem.py to check the officer is in the set_capabilities_still_available before trying to remove them. I don't think this should significantly affect the scaling of the runs (except maybe if a huge fraction of HSIs do this, which I doubt). Otherwise, could you please pause on this issue for a bit while I consider how we can review mode 2 to account for this please.

If you are happy with the "quick fix" I can submit it as a PR.

willGraham01 commented 1 year ago

Hi @marghe-molaro. My preference would be to switch (for the time being) the profiling runs to avoid mode 2 now that we know that mode is problematic, instead of adding the "quick fix". I can always switch them back when mode 2 is fixed (and it saves us an extra PR on top of #1012 just to fix one particular problem).

Adding the "quick fix" might make us forget about this being a problem further down the line, which could be problematic for trust/reliability reasons.

Also if it's of any insight, this problem seems to crop up 3 times in a month-long simulation (IE running the scale_run used in profiling and adding a counter for the number of times the "quick-fix" would be needed). Not sure how that's expected to upscale to a longer and larger simulation, but thought it might be useful info to have.

marghe-molaro commented 1 year ago

Hi @willGraham01, yes very happy with your proposed solution (although I definitely wouldn't have forgotten about it, it's pretty major in what I'm doing atm!! :) ).

Thanks for the insight, can you remind me what the assumed population size would have been for the month-long simulation you're quoting? It definitely sounds rare enough, but would be good to check which HSIs are affected. I will ask on the programming channel too if anyone can think of HSIs which may use different sets of medical officers in the expected vs actual footprint.

willGraham01 commented 1 year ago

Initial population for the month-long simulation is 50,000. Not sure what it rises to by the end of the month/simulation, but the population dataframe has only been expanded to 51,000 rows so somewhere between 50-51 thousand.

If you would like the rest of the parameters I'm using, they're saved here (that link should hopefully be permanent).

And running the scale_run.py program like above will produce the error around day 11-17 (takes about 5 mins to get to the error state when running on my laptop), if you want to check for yourself what's going on in the simulation when this happens - afraid I don't know much about the inner workings (sorry)!

marghe-molaro commented 1 year ago

Ok 50,000 is pretty large, hopefully that means it is truly a rare occurrence! Will check locally as you suggested.

tbhallett commented 1 year ago

Ok 50,000 is pretty large, hopefully that means it is truly a rare occurrence! Will check locally as you suggested.

Agree it seems rare. Also, I don't see a better solution than ignoring it!

UCL / TLOmodel

[BUG] `update_call` assigns work to officers with no time available #1089

To Reproduce:

Other Information