Closed willGraham01 closed 1 year ago
Issue seems to be that within process_events_mode_2
; the capabilities_monitor
variable is updated using the subtract
method. Jump to the file with changes here.
This method does not remove keys with negative values, which is the purpose of the for
-loop that comes next. However capabilities_monitor
persists between loop iterations; meaning that next iteration the same key may still have a negative value, and we will attempt to remove it again from set_capabilities_still_available
where it doesn't exist, hence a KeyError
.
This looks to be the cause; output of my debug log:
FacilityID_99_Officer_Pharmacy is in set_capabilities
updated_call is dict_items([('FacilityID_99_Officer_Clinical', 33.0), ('FacilityID_99_Officer_Nursing_and_Midwifery', 10.0), ('FacilityID_99_Officer_Pharmacy', 11.900000000000002)])
capabilities_monitor[FacilityID_99_Officer_Pharmacy] = 3.204738184694463 (was 15.104738184694465)
updated_call is dict_items([('FacilityID_99_Officer_Clinical', 33.0), ('FacilityID_99_Officer_Nursing_and_Midwifery', 10.0), ('FacilityID_99_Officer_Pharmacy', 11.900000000000002)])
capabilities_monitor[FacilityID_99_Officer_Pharmacy] = -8.695261815305539 (was 3.204738184694463)
Removing FacilityID_99_Officer_Pharmacy == FacilityID_99_Officer_Pharmacy from set_capabilities
updated_call is dict_items([('FacilityID_99_Officer_Clinical', 28.0), ('FacilityID_99_Officer_Nursing_and_Midwifery', 16.0), ('FacilityID_99_Officer_Pharmacy', 6.0), ('FacilityID_99_Officer_Radiography', 18.0)])
capabilities_monitor[FacilityID_99_Officer_Pharmacy] = -14.695261815305539 (was -8.695261815305539)
Removing FacilityID_99_Officer_Pharmacy == FacilityID_99_Officer_Pharmacy from set_capabilities
<ERROR>
And in particular on the loop iteration that throws the error:
== New while loop iteration ==
FacilityID_99_Officer_Pharmacy has capacity -8.695261815305539 at start of iteration
updated_call wants to assign 6.0 time to FacilityID_99_Officer_Pharmacy
FacilityID_99_Officer_Pharmacy now has capacity -14.695261815305539
Removing FacilityID_99_Officer_Pharmacy from set_capabilities
<ERROR>
update_call
is still assigning work to this particular officer, despite the fact that they have no more time left.
Suggestions for the cause:
actual_appt_footprint
which then produces update_call
is not not being told about the officers that are no longer available after resolving the latest event at the front of the queue. Thus, it thinks it can assign extra time to an officer it thinks still has capacity, but no longer does.ID_99
has several of their functions/roles updated at once in the final step. It might be possible that the model is only checking for whether 99_Officer_Radiography
has time left to apply a treatment, but is then assuming that ID_99
has time across all of their functions to apply said treatment. This is supported by the fact that at the point of erroring, the ID_99_Officer_Radiography
has around 8 "time" left, and is then assigned a further 18 "time" (which is allowed due to overtime). However, the update_call
then also wants to assigned additional time to ID_99
for Clinical
, Nursing_and_Midwifery
, and Pharmacy
(which is the value already below 0).Pinging @marghe-molaro !
Hi @willGraham01, indeed, the issue appears to be that the "actual" appointment footprint - which can only be calculated when actually running the event - at times relies on a different set of medical officers than those assumed by the "expected" footprint; in the case where one of these officers already had exhausted capabilities before running the event (in which case the event shouldn't have been run at all!) this throws an error when trying to remove that officer (again) from the set_capabilities_still_available.
This constitutes a fundamental problem (rather than a bug) for mode 2, because by definition the actual footprint can only be obtained by running the event, and therefore we cannot determine prior running it whether enough medical capabilities were available to run it in the first place. I would need to discuss with @tbhallett what the strategy should be moving forward (there might be very few instances where the actual footprint requires an extra officer, in which case we could maybe ignore the issue of capabilities for those HSIs, but if a significant number of them do this we might need to re-evaluate how mode 2 is computed significantly).
Long story short, @willGraham01 and @tamuri - for a quick fix to continue with the scale run analysis you could add a safety if statement on line 2458 in healthsystem.py to check the officer is in the set_capabilities_still_available before trying to remove them. I don't think this should significantly affect the scaling of the runs (except maybe if a huge fraction of HSIs do this, which I doubt). Otherwise, could you please pause on this issue for a bit while I consider how we can review mode 2 to account for this please.
If you are happy with the "quick fix" I can submit it as a PR.
Hi @marghe-molaro. My preference would be to switch (for the time being) the profiling runs to avoid mode 2 now that we know that mode is problematic, instead of adding the "quick fix". I can always switch them back when mode 2 is fixed (and it saves us an extra PR on top of #1012 just to fix one particular problem).
Adding the "quick fix" might make us forget about this being a problem further down the line, which could be problematic for trust/reliability reasons.
Also if it's of any insight, this problem seems to crop up 3 times in a month-long simulation (IE running the scale_run
used in profiling and adding a counter for the number of times the "quick-fix" would be needed). Not sure how that's expected to upscale to a longer and larger simulation, but thought it might be useful info to have.
Hi @willGraham01, yes very happy with your proposed solution (although I definitely wouldn't have forgotten about it, it's pretty major in what I'm doing atm!! :) ).
Thanks for the insight, can you remind me what the assumed population size would have been for the month-long simulation you're quoting? It definitely sounds rare enough, but would be good to check which HSIs are affected. I will ask on the programming channel too if anyone can think of HSIs which may use different sets of medical officers in the expected vs actual footprint.
Initial population for the month-long simulation is 50,000. Not sure what it rises to by the end of the month/simulation, but the population dataframe has only been expanded to 51,000 rows so somewhere between 50-51 thousand.
If you would like the rest of the parameters I'm using, they're saved here (that link should hopefully be permanent).
And running the scale_run.py
program like above will produce the error around day 11-17 (takes about 5 mins to get to the error state when running on my laptop), if you want to check for yourself what's going on in the simulation when this happens - afraid I don't know much about the inner workings (sorry)!
Ok 50,000 is pretty large, hopefully that means it is truly a rare occurrence! Will check locally as you suggested.
Ok 50,000 is pretty large, hopefully that means it is truly a rare occurrence! Will check locally as you suggested.
Agree it seems rare. Also, I don't see a better solution than ignoring it!
The current version of the model throws a
KeyError
when attempting to run thescale_run
profiling script.Discovered after returning to issue #686's profiling PR (#1012) and the error being raised after a rebase onto
aace4d1
.To Reproduce:
In a terminal at the root of the TLOmodel repository:
The parameters passed to the
scale_run.py
script to reproduce this bug are:and the KeyError is thrown approximately 35% of the way through the simulation.
Other Information
This bug is not present on commit
267c1e85f
, so was introduced at some point between then andaace4d1
.git bisect
implies that73429a738a86d618dea67cbe48155a7959d72744
is the first bad commit (Tuesday 15th Aug 2023).