Open matt-graham opened 3 years ago
After the changes made in #287 a corresponding profiling run as described above gives the following SnakeViz output
which is significantly improved but there are still some bottlenecks so I will keep this issue open for now to keep track of progress.
After the changes made in #303 a profiling run as described above gives the following SnakeViz output
Next :dart::
get_squeeze_factors
After the changes made in #308, a profiling run as described above gives the following SnakeViz output
As running for longer simulation times is now tractable without being overly onerous, going forwards I'll use a 6 month simulation time run of scale_run.py
as the baseline to give a better reflection of the end goal of doing a full 20 year run. I'll also switch to using pyinstrument
rather than cProfile
/ SnakeViz as it seems to give less biased figures due to lower overhead, is more directly interpretable as it directly estimates the time spent at each level of the call tree and also is easy to include the text output from in PRs vs screenshotting SnakeViz!
After the changes made in #308, a pyinstrument
profiling output for a 6 month run of scale_run.py
is as below
pyinstrument
profiling output for 6 month run using 5fcd391 Summary of recent progress in reducing scale_run.py
run times:
pyinstrument
profiling output for 6 month run using 5fcd391 (pre-PR), 4060s total pyinstrument
profiling output for 6 month run using b290f04 (post-PR), 3120s total pyinstrument
profiling output for 6 month run using ad0e751 (pre-PR), 4990s totalpyinstrument
profiling output for 6 month run using 30d6c1e (post-PR), 3910s total Counter
update method to sum footprints)capabilities_coefficient
once only)Summary of recent PRs reducing scale_run.py
run times:
A full 20 year / 20k population run of scale_run.py
with profiling enabled now takes a relatively manageable five and half hours (and probably a bit less without the profiling overhead). We can probably therefore consider increasing the computational demands of the profiling runs. @tbhallett are there any changes you think it would be worth making to the profiling run configurations, e.g. increasing the simulation population or adding in further disease modules?
The SnakeViz breakdown for a 20 year / 20k population run on current master (a691b72) is as follows
The ten event methods with the highest proportions of the overall run time are as follows:
healthsystem.HealthSystemScheduler.apply
27.7%healthseekingbehaviour.HealthSeekingBehaviourPoll.apply
14.7%pregnancy_supervisor.PregnancySupervisorEvent.apply
8.56%cardio_metabolic_disorders.CardioMetabolicDisorders_MainPollingEvent.apply
7.16%symptommanager.SymptomManager_SpuriousSymptomResolve.apply
5.20%labour.BirthEvent.apply
4.92%symptommanager.SymptomManager_SpuriousSymptomOnset.apply
4.74%postnatal_supervisor.PostnatalSupervisorEvent.apply
4.46%postnatal_supervisor.PostnatalWeekOneEvent.apply
2.82%malaria.MalariaCureEvent.apply
2.55%The HealthSystemScheduler.apply
and HealthSeekingBehaviourPoll.apply
are still the two largest time consumers, taking together around 42.4% of the overall run time, though this is significantly improved compared to prior to some of the recent optimisation when these two methods together constituted the vast majority of the run time.
Within HealthSystemScheduler.apply
, HealthSystem.impose_beddays_footprint
is quite costly constituting 6.36% of overall run time and 23.0% of the time in HealthSystemScheduler.apply
, and hopefully the changes #258 will help reduce this and/or enable making further optimizations. HealthSystem.get_squeeze_factors
is also relatively costly (3.84% of overall run time, 13.8% of time in HealthSystemScheduler.apply
) and there may be scope for some further optimizations there though given there have already been several rounds of optimization of get_squeeze_factors
I suspect there is not too much room for further improvement.
Within HealthSeekingBehaviourPoll.apply
, HealthSystem.schedule_hsi_event
is the largest contributor to run time, with 9.19% of the overall run time and around 62.4% of the time in HealthSeekingBehaviourPoll.apply
. As commented in #346, as many similar events are scheduled simultaneously in HealthSeekingBehaviourPoll.apply
there is potentially room for improvement here by adding a 'batched' schedule_hsi_event
equivalent which avoids redundantly repeating checks etc. when many of the attributes of the HSI events being scheduled are shared.
Within pregnancy_supervisor.PregnancySupervisorEvent.apply
there is no obvious bottleneck with the time split realtively evenly over many different methods so there does not seem to obvious avenues for significant improvement.
Within cardio_metabolic_disorders.CardioMetabolicDisorders_MainPollingEvent.apply
nearly all the time is being spent in evaluating the predict
method of linear models. This suggests there may be some value in either trying to further optimize the linear model predict function (possibly exploring some of the ideas mentioned in #267) in general, with the total time spent in LinearModel.predict
13.8% of the total run time, or specifically in the context of cardio_metabolic_disorders.CardioMetabolicDisorders_MainPollingEvent.apply
it may be worth exploring whether optimizing the linear model implementations there e.g. by creating custom predict functions to use in LinearModel.custom
, may be worthwhile.
Together the apply
methods of the two spurious symptoms related events SymptomManager_SpuriousSymptomResolve
and SymptomManager_SpuriousSymptomOnset
constitute around 10% of the overall run time and so there is potentially scope for looking at possible improvements in the handling of spurious symptoms.
A full 20 year / 20k population run of
scale_run.py
with profiling enabled now takes a relatively manageable five and half hours (and probably a bit less without the profiling overhead). We can probably therefore consider increasing the computational demands of the profiling runs. @tbhallett are there any changes you think it would be worth making to the profiling run configurations, e.g. increasing the simulation population or adding in further disease modules?
To provide one additional datapoint, a 5 year run of scale_run.py
with profiling enabled and an initial population of 50k using the code on 4c51802 (current tip of branch in #354) took 3 hours to complete, so extrapolating, a full 20 year run would be roughly 12 hours which seems reasonable. This was with the settings in scale_run.py
otherwise unchanged (in particular no changes were made to the capabilities_coefficient
value which I think may need to be adjusted to reflect the change in scaling?).
Thanks for all this @matt-graham; c. 5h for 20k/20y simulation with all the bells and whistles feels really manageable to me!!
I think going up to 50k would make sense as would be provide a respectable c1800 persons per district.
Yes, there are now new completed disease modules that we should fold into the profiling scrips (notably Alri
, BladderCancer
and Measles
). Should we add these in now? I can do a quick PR. (Shortly three more will be added, which will be the complete set: "Rti", "Tb" and "Malnutrition").
I'll also do a PR that updates the capabilities_coefficient
.
Looking through the various things, I see that we rely on simplified_births
in many cases. I wonder if we should stop doing that, and let all the profiling rely on the full set of modules for contraceptions/births/labour etc.
But, apart from these thoughts, I don't think we need to change the spec of these runs. Let me know what you and @tamuri think and if you'd need me to do a PR suggesting these changes.
Summary of recent PRs reducing scale_run.py
run times:
LinearModel
predictor conditions)HealthSystemScheduler
)After the updates to scale_run.py
in #363, a 5 year profiled run with the configuration otherwise left as the new defaults (i.e. 50k initial population and capabilities_coefficient
set based on ratio of initial population to actual estimated 2010 population), the SnakeViz output for the overall breakdown of the toal run time (9980s) is
The ten event methods with the highest proportions of the overall run time are as follows:
healthsystem.HealthSystemScheduler.apply
22.5% (2250s)healthseekingbehaviour.HealthSeekingBehaviourPoll.apply
18.3% (1820s)labour.BirthEvent.apply
7.04% (703s)symptommanager.SymptomManager_SpuriousSymptomOnset.apply
5.41% (540s)malaria.MalariaCureEvent.apply
5.01% (500s)pregnancy_supervisor.PregnancySupervisorEvent.apply
5.00% (499s)symptommanager.SymptomManager_SpuriousSymptomResolve.apply
4.07% (407s)postnatal_supervisor.PostnatalWeekOneEvent.apply
3.93% (392s)postnatal_supervisor.PostnatalSupervisorEvent.apply
2.87% (287s)labour.LabourAtHomeEvent.apply
2.84% (284s)Some initial thoughts about possible performance improvements
The total time spent in functions / methods in the symptommanager
module is now quite high, both in the symptommanager.SymptomManager_SpuriousSymptomOnset
and symptommanager.SymptomManager_SpuriousSymptomResolve
events themselves, plus virtually all the time spent in malaria.MalariaCureEvent.apply
is in running clear_symptoms
.
The latter arises in the lines
which clears all the symptoms for a set of person IDs in a loop - it might be there could some performance gain therefore from extending clear_symptoms
to allow passing a set / list of integers for the person_id
argument.
In clear_symptoms
itself, the main work is done in a loop over a set of symptoms, with change_symptom
called for each symptom
Running change_symptom
for each of a set of symptoms seems to be quite a common pattern
https://github.com/UCL/TLOmodel/blob/7812c645fd6a177180482db460f040f00355eb4d/src/tlo/methods/alri.py#L1013-L1019 https://github.com/UCL/TLOmodel/blob/7812c645fd6a177180482db460f040f00355eb4d/src/tlo/methods/alri.py#L1531-L1537 https://github.com/UCL/TLOmodel/blob/7812c645fd6a177180482db460f040f00355eb4d/src/tlo/methods/diarrhoea.py#L1042-L1053 https://github.com/UCL/TLOmodel/blob/7812c645fd6a177180482db460f040f00355eb4d/src/tlo/methods/malaria.py#L482-L490 https://github.com/UCL/TLOmodel/blob/7812c645fd6a177180482db460f040f00355eb4d/src/tlo/methods/malaria.py#L524-L531 https://github.com/UCL/TLOmodel/blob/7812c645fd6a177180482db460f040f00355eb4d/src/tlo/methods/measles.py#L280-L294 https://github.com/UCL/TLOmodel/blob/7812c645fd6a177180482db460f040f00355eb4d/src/tlo/methods/symptommanager.py#L605-L632 https://github.com/UCL/TLOmodel/blob/7812c645fd6a177180482db460f040f00355eb4d/src/tlo/methods/symptommanager.py#L658-L667
which suggests there might also be some gain to either generalising change_symptom
to allow passing a list / set of symptoms to try to avoid the overhead of repeated calls.
A considerable amount of time is being spent in schedule_hsi_event
- 1090s which is 11.0% of the overall run time. This is partly due to the high number of individual calls (10342802) to this method, corresponding to an average rate of about 5700 HSI events being scheduled per simulated day; for a population of 50k this seems reasonable (?). Although we have already done several rounds of optimization to schedule_hsi_event
, given it is getting called so often it is probably worth doing a further pass to see if there are any even minor optimizations that can be made. From the breakdown of the time spent
a considerable amount of time is being spent in operations in the function itself as well as in calling get_appt_footprint_as_time_request
.
Another possible target for optimization is the __init__
method of HSI_GenericFirstApptAtFacilityLevel1
which is called 9262660 times (suggesting HSI_GenericFirstApptAtFacilityLevel1
constitute the vast majority of the HSI events being scheduled), with total time spent in this method 482s (4.83% of overall time).
Thanks for this Matt
my first quick reactions—-
i totally agree on the making symptom manager accept a set of symptoms to onset for one person or more than one person, and otherwise making that work in batches. I have often thought about doing this but hadn’t clocked how big of a deal it may be for performance.
I am a little worried that some of the profiling results is due to the rate of spurious symptoms between very high. We don’t know the real rate but the values in there are probably too high when all the disease modules are in. I haven’t got a good answer for that. But just to be aware that the numbers of hits of spurious symptoms is probably too high. In the diarrhoea PR I turn it down a bit (done today!). Perhaps I should pull that into it’s own PR so we can see the effect.
re the init of GenericHSI: I had wondered if we could make this into a batch thing, like we did with SpuriousSymptomBatchOnset. ie, one event per day, which does the “work” for each person that is going to have a GenericHSI that day. Every GenericHSI is scheduled by HealthSeekingBehavioir and for the “tomorrow”.. so that module could curate a record of the people instead (???).
Thanks @tbhallett
- i totally agree on the making symptom manager accept a set of symptoms to onset for one person or more than one person, and otherwise making that work in batches. I have often thought about doing this but hadn’t clocked how big of a deal it may be for performance.
Okay it seems like this is probably a good first target for me to work on then!
- I am a little worried that some of the profiling results is due to the rate of spurious symptoms between very high. We don’t know the real rate but the values in there are probably too high when all the disease modules are in. I haven’t got a good answer for that. But just to be aware that the numbers of hits of spurious symptoms is probably too high. In the diarrhoea PR I turn it down a bit (done today!). Perhaps I should pull that into it’s own PR so we can see the effect.
Thanks that's useful to know - I'll hold off on looking at specific optimisation the spurious symptoms related events for now then until we've checked if reducing the rate is sufficient to make these events non-performance critical.
- re the init of GenericHSI: I had wondered if we could make this into a batch thing, like we did with SpuriousSymptomBatchOnset. ie, one event per day, which does the “work” for each person that is going to have a GenericHSI that day. Every GenericHSI is scheduled by HealthSeekingBehavioir and for the “tomorrow”.. so that module could curate a record of the people instead (???).
Ah that's an interesting idea - given the number of generic HSI events and also the large time spent scheduling them in HealthSeekingBehaviour
this sort of approach could be quite a big performance gain, I'll have a look at how this could be done!
A 5 year / 50k initial population run of scale_run.py
on bfb03a7 gives the following SnakeViz plot of the profiling results
with the overall simulation time 25320s.
There seem to be a few parts of some of the newer / rewritten modules that would benefit from some refactoring to improve performance:
rti
module 4540s (18% of overall run time) is being spent in evaluating the line
https://github.com/UCL/TLOmodel/blob/7669163f245bd56616a2934fb900710675f1bdcf/src/tlo/methods/rti.py#L3043
I think the poor performance here is due to the rt_date_to_remove_daly
column being of object dtype
as it contains a list. This means that operations on the column will fallback to Python loops rather than using faster NumPy array operations. As the lists are all of the same length (8) I think it would make more sense here to have 8 distinct columns of data type (matching the 8 numbered rt_injury_*
columns). This would then allow the apply
operation here to be replaced with something like
any_not_null = df.loc[df.is_alive, [f'rt_date_to_remove_daly_{i}' for i in range(8)].notnull().any(axis=1)
schisto
module 2550s (10% of overall run time) is being spent in evaluating the method SchistoSpecies.update_infectious_status_and_symptoms
. The main bottlenecks seem to be in evaluating the line
https://github.com/UCL/TLOmodel/blob/7669163f245bd56616a2934fb900710675f1bdcf/src/tlo/methods/schisto.py#L578
and
https://github.com/UCL/TLOmodel/blob/7669163f245bd56616a2934fb900710675f1bdcf/src/tlo/methods/schisto.py#L550-L553
The former might warrant looking at if we can do any further optimizations to SymptomManager
though from a brief look I can't see anything obvious that can be improved about the clear_symptoms
implementation. The _inf_status
function used in the apply
in the latter consists of a series of boolean logic that could potentially be translated to boolean operations directly on the column / Series
objects which might speed things up as I think the apply
will be applied to each row individually.labour
module 1600s (6% of overall run time) is being spent in evaluating the Labour.predict
method. About half of this seems to be in indexing __getitem__
operations. The repeated indexing of the mni
object iby person_id
n
https://github.com/UCL/TLOmodel/blob/7669163f245bd56616a2934fb900710675f1bdcf/src/tlo/methods/labour.py#L1081-L1095
could be avoided by setting mni = self.sim.modules['PregnancySupervisor'].mother_and_newborn_info[person_id]
initially though its not clear if this the bottleneck here as looking at the referenced Pandas __getitem__
operation it seems to be for a _LocationIndexer
object suggesting it might be the person = df.loc[[person_id]]
which is the culprit. If so it is not immediately clear how to reduce the cost.Can we get a recent profile viz? I know at least the RTI fix was merged (#682). Thanks.
A 5 year / 40k initial population run of scale_run.py on https://github.com/UCL/TLOmodel/commit/6b9a1d5e8bd7e6817cf479b3fec7391a969a68ae gives the following SnakeViz plot of the profiling results:
with the overall simulation time at ~11500s on an Apple M1 chip.
The ten methods with the highest proportions of the overall time are the following:
labour.BirthandPostnatalOutcomesEvent.apply
at 15.66% (1800s )schisto.SchistoMatureWorms.apply
at 10.10% (1160s)healthseekingbehaviour.HealthseekingBehaviourPoll.apply
at 8.95% (1030s)healthsystem.HealthSystemScheduler.apply
at 7.18% (824s)postnatal_supervisor.PostnatalWeekOneNeonatalEvent.apply
at 5.43% (623s)labour.LabourOnsetEvent.apply
at 5.20% (597s)labour.LabourAtHomeEvent.apply
at 4.58% (526s)postnatal_supervisor.PostnatalSupervisorevent.apply
at 4.37% (501s)postnatal_supervisor.PostnatalWeekOneMaternalEvent.apply
at 3.89% (447s)diarrhoea.DiarrhoeaNaturalRecoveryEvent.apply
at 3.6% (413s)Snakeviz shows larger overheads than pyinstrument
with the overall time of a similar run reduced to 9092s.
The results with pyinstrument
for 5 year/40 k population can be accessed here
The results with pyinstrument
for 5 year/20 k population can be accessed here
Please note that to make the html files accessible in their current format I had to make them public, let me know if this is an issue.
Thanks, Dimitra. My vote for two worth exploring are the HealthSeekingBehaviourPoll (4) and then the SchistoMatureWorms event (2). The first seems to be a fairly lengthy LinearModel which is actually not called very often. It might improve if the LinearModel is setup using a custom function. The second looks like an individual event that is called many times. Need to discuss with @tbhallett (perhaps @tdm32 ?) about whether that can be replaced by population-level event without losing important behaviour of the model.
On the second point about SchistoMatureWorms
, I am sure we could do something using a 'Poll' and do the progression for the whole population at once.
Making a note that scale runs, which so far have focused on mode_appt_constaints = 1, in the future should also consider mode_appt_constraints = 2 (see PR #986) as this may become more widely used by analysts.
Performance issue already identified is related to the length of the HSI queue, which is dependent on assumptions around tclose (see Issue #999. However note that since this issue was first written up some changes where made in the way the the hsi queue is queried under mode_appt_constraints = 2 which may have improved matters).
/run profiling
The current version of
src/scripts/profiling/scale_run.py
is running very slowly. After reducing the simulation time to 1 month it still took around 2.5 hours to complete a run on my laptop, so if this simulation rate remained the same running for the full 20 years here would take around ~25 days!From some trial-and-error the main causes of the slowdown compared to the previous script (which for comparison took around 30 minutes to do a 2 year simulation with the same population size, but different set of modules / configuration) seems to be both of setting
spurious_symptoms=True
in the initializer for theSymptomManager
modulehttps://github.com/UCL/TLOmodel/blob/63f58f06ec4dd706ab3f0d39415dc2512399d729/src/scripts/profiling/scale_run.py#L66
and setting
mode_appt_constraints=2
andcapabilities_coefficient=0.01
in the initializer for theHealthSystem
module,https://github.com/UCL/TLOmodel/blob/63f58f06ec4dd706ab3f0d39415dc2512399d729/src/scripts/profiling/scale_run.py#L71-L74
with each individually causing a large slow down. The additional disease modules seem to be only adding a small overhead in comparison.
A SnakeViz plot of running for 1 month simulation time with the original configuration is below
Just under 80% of the run time is being spent in
get_appt_footprint_as_time_request
which is called 1351075 times.In comparison running for 1 month simulation time with
spurious_symptoms=False
inSymptomManager
andmode_appt_constraints=0, capabilities_coefficient=1
inHealthSystem
gives the following SnakeViz plotIn this case the percentage of time spent in
get_appt_footprint_as_time_request
is much reduced though still significant (27%) and the number of calls much lower (24344, ~2% of 1351075). The overall runtime is also around 6% of previously.