UCL / TLOmodel

Epidemiology modelling framework for the Thanzi la Onse project
https://www.tlomodel.org/
MIT License
10 stars 4 forks source link

Improving performance: profiling `scale_run` simulations #286

Open matt-graham opened 3 years ago

matt-graham commented 3 years ago

The current version of src/scripts/profiling/scale_run.py is running very slowly. After reducing the simulation time to 1 month it still took around 2.5 hours to complete a run on my laptop, so if this simulation rate remained the same running for the full 20 years here would take around ~25 days!

From some trial-and-error the main causes of the slowdown compared to the previous script (which for comparison took around 30 minutes to do a 2 year simulation with the same population size, but different set of modules / configuration) seems to be both of setting spurious_symptoms=True in the initializer for the SymptomManager module

https://github.com/UCL/TLOmodel/blob/63f58f06ec4dd706ab3f0d39415dc2512399d729/src/scripts/profiling/scale_run.py#L66

and setting mode_appt_constraints=2 and capabilities_coefficient=0.01 in the initializer for the HealthSystem module,

https://github.com/UCL/TLOmodel/blob/63f58f06ec4dd706ab3f0d39415dc2512399d729/src/scripts/profiling/scale_run.py#L71-L74

with each individually causing a large slow down. The additional disease modules seem to be only adding a small overhead in comparison.

A SnakeViz plot of running for 1 month simulation time with the original configuration is below

image

Just under 80% of the run time is being spent in get_appt_footprint_as_time_request which is called 1351075 times.

In comparison running for 1 month simulation time with spurious_symptoms=False in SymptomManager and mode_appt_constraints=0, capabilities_coefficient=1 in HealthSystem gives the following SnakeViz plot

image

In this case the percentage of time spent in get_appt_footprint_as_time_request is much reduced though still significant (27%) and the number of calls much lower (24344, ~2% of 1351075). The overall runtime is also around 6% of previously.

matt-graham commented 3 years ago

After the changes made in #287 a corresponding profiling run as described above gives the following SnakeViz output

SnakeViz output for profiling run post #287

which is significantly improved but there are still some bottlenecks so I will keep this issue open for now to keep track of progress.

matt-graham commented 3 years ago

After the changes made in #303 a profiling run as described above gives the following SnakeViz output Snake viz output Next :dart:: get_squeeze_factors

matt-graham commented 3 years ago

After the changes made in #308, a profiling run as described above gives the following SnakeViz output

image

As running for longer simulation times is now tractable without being overly onerous, going forwards I'll use a 6 month simulation time run of scale_run.py as the baseline to give a better reflection of the end goal of doing a full 20 year run. I'll also switch to using pyinstrument rather than cProfile / SnakeViz as it seems to give less biased figures due to lower overhead, is more directly interpretable as it directly estimates the time spent at each level of the call tree and also is easy to include the text output from in PRs vs screenshotting SnakeViz!

After the changes made in #308, a pyinstrument profiling output for a 6 month run of scale_run.py is as below

pyinstrument profiling output for 6 month run using 5fcd391 ``` 4061.675 ../:1 [4 frames hidden] .., runpy 4061.675 _run_code runpy.py:64 └─ 4061.675 scale_run_updated.py:1 └─ 4047.957 simulate tlo/simulation.py:194 ├─ 4000.051 fire_single_event tlo/simulation.py:262 │ └─ 4000.046 run tlo/events.py:32 │ ├─ 2501.234 apply tlo/methods/healthsystem.py:1329 │ │ ├─ 1200.651 tlo/methods/healthsystem.py:1407 │ │ │ └─ 1173.925 get_appt_footprint_as_time_request tlo/methods/healthsystem.py:738 │ │ │ ├─ 414.627 tlo/methods/healthsystem.py:771 │ │ │ ├─ 264.343 get_facility_info tlo/methods/healthsystem.py:727 │ │ │ │ ├─ 208.279 __getitem__ pandas/core/indexing.py:2148 │ │ │ │ │ [34 frames hidden] pandas, .. │ │ │ │ └─ 43.083 [self] │ │ │ ├─ 217.513 [self] │ │ │ ├─ 124.905 __init__ collections/__init__.py:540 │ │ │ │ [4 frames hidden] collections │ │ │ └─ 111.355 __getitem__ pandas/core/frame.py:2987 │ │ │ [22 frames hidden] pandas, .. │ │ ├─ 392.668 __iadd__ collections/__init__.py:807 │ │ │ [12 frames hidden] collections, .. │ │ ├─ 227.622 heappop ../:0 │ │ │ [2 frames hidden] .. │ │ ├─ 187.204 __getitem__ pandas/core/indexing.py:2148 │ │ │ [34 frames hidden] pandas, .. │ │ ├─ 178.306 log_hsi_event tlo/methods/healthsystem.py:1013 │ │ │ ├─ 78.200 tlo/methods/healthsystem.py:1039 │ │ │ ├─ 51.226 [self] │ │ │ └─ 41.073 info tlo/logging/core.py:208 │ │ ├─ 170.586 [self] │ │ ├─ 51.558 get_squeeze_factors tlo/methods/healthsystem.py:791 │ │ │ └─ 49.889 tlo/methods/healthsystem.py:830 │ │ └─ 41.507 did_not_run tlo/methods/hsi_generic_first_appts.py:335 │ ├─ 1166.280 apply tlo/methods/healthseekingbehaviour.py:193 │ │ ├─ 836.514 schedule_hsi_event tlo/methods/healthsystem.py:385 │ │ │ ├─ 494.516 tlo/methods/healthsystem.py:500 │ │ │ │ ├─ 200.845 all pandas/core/generic.py:10870 │ │ │ │ │ [1796 frames hidden] pandas, .., typing, numpy, abc, inspect │ │ │ │ ├─ 198.571 __getitem__ pandas/core/indexing.py:882 │ │ │ │ │ [1030 frames hidden] pandas, numpy, .., abc, typing, conte... │ │ │ │ └─ 49.394 new_method pandas/core/ops/common.py:50 │ │ │ │ [458 frames hidden] pandas, .., numpy, abc │ │ │ ├─ 154.988 __getitem__ pandas/core/indexing.py:882 │ │ │ │ [1606 frames hidden] pandas, .., numpy, typing, abc, conte... │ │ │ └─ 56.658 new_method pandas/core/ops/common.py:50 │ │ │ [546 frames hidden] pandas, .., numpy, abc, warnings │ │ ├─ 144.549 have_what tlo/methods/symptommanager.py:435 │ │ │ └─ 142.867 apply pandas/core/frame.py:7622 │ │ │ [956 frames hidden] pandas, .., numpy, abc │ │ │ 142.453 apply_series_generator pandas/core/apply.py:281 │ │ │ ├─ 82.435 tlo/methods/symptommanager.py:440 │ │ │ │ └─ 81.897 tlo/methods/symptommanager.py:440 │ │ │ │ └─ 71.790 __getitem__ pandas/core/series.py:837 │ │ │ │ [38 frames hidden] pandas, .. │ │ └─ 76.369 iterrows pandas/core/frame.py:1026 │ │ [375 frames hidden] pandas, .., abc, numpy │ └─ 97.779 apply tlo/methods/malaria.py:557 │ └─ 94.663 schedule_hsi_event tlo/methods/healthsystem.py:385 │ └─ 56.425 tlo/methods/healthsystem.py:500 └─ 46.343 initialise_simulation tlo/methods/hiv.py:557 └─ 46.136 get_time_from_infection_to_aids tlo/methods/hiv.py:862 ```
matt-graham commented 2 years ago

Summary of recent progress in reducing scale_run.py run times:

313 (changing appointment footprint to sparse structure)

pyinstrument profiling output for 6 month run using 5fcd391 (pre-PR), 4060s total ``` 4061.675 ../:1 [4 frames hidden] .., runpy 4061.675 _run_code runpy.py:64 └─ 4061.675 scale_run_updated.py:1 └─ 4047.957 simulate tlo/simulation.py:194 ├─ 4000.051 fire_single_event tlo/simulation.py:262 │ └─ 4000.046 run tlo/events.py:32 │ ├─ 2501.234 apply tlo/methods/healthsystem.py:1329 │ │ ├─ 1200.651 tlo/methods/healthsystem.py:1407 │ │ │ └─ 1173.925 get_appt_footprint_as_time_request tlo/methods/healthsystem.py:738 │ │ │ ├─ 414.627 tlo/methods/healthsystem.py:771 │ │ │ ├─ 264.343 get_facility_info tlo/methods/healthsystem.py:727 │ │ │ │ ├─ 208.279 __getitem__ pandas/core/indexing.py:2148 │ │ │ │ │ [34 frames hidden] pandas, .. │ │ │ │ └─ 43.083 [self] │ │ │ ├─ 217.513 [self] │ │ │ ├─ 124.905 __init__ collections/__init__.py:540 │ │ │ │ [4 frames hidden] collections │ │ │ └─ 111.355 __getitem__ pandas/core/frame.py:2987 │ │ │ [22 frames hidden] pandas, .. │ │ ├─ 392.668 __iadd__ collections/__init__.py:807 │ │ │ [12 frames hidden] collections, .. │ │ ├─ 227.622 heappop ../:0 │ │ │ [2 frames hidden] .. │ │ ├─ 187.204 __getitem__ pandas/core/indexing.py:2148 │ │ │ [34 frames hidden] pandas, .. │ │ ├─ 178.306 log_hsi_event tlo/methods/healthsystem.py:1013 │ │ │ ├─ 78.200 tlo/methods/healthsystem.py:1039 │ │ │ ├─ 51.226 [self] │ │ │ └─ 41.073 info tlo/logging/core.py:208 │ │ ├─ 170.586 [self] │ │ ├─ 51.558 get_squeeze_factors tlo/methods/healthsystem.py:791 │ │ │ └─ 49.889 tlo/methods/healthsystem.py:830 │ │ └─ 41.507 did_not_run tlo/methods/hsi_generic_first_appts.py:335 │ ├─ 1166.280 apply tlo/methods/healthseekingbehaviour.py:193 │ │ ├─ 836.514 schedule_hsi_event tlo/methods/healthsystem.py:385 │ │ │ ├─ 494.516 tlo/methods/healthsystem.py:500 │ │ │ │ ├─ 200.845 all pandas/core/generic.py:10870 │ │ │ │ │ [1796 frames hidden] pandas, .., typing, numpy, abc, inspect │ │ │ │ ├─ 198.571 __getitem__ pandas/core/indexing.py:882 │ │ │ │ │ [1030 frames hidden] pandas, numpy, .., abc, typing, conte... │ │ │ │ └─ 49.394 new_method pandas/core/ops/common.py:50 │ │ │ │ [458 frames hidden] pandas, .., numpy, abc │ │ │ ├─ 154.988 __getitem__ pandas/core/indexing.py:882 │ │ │ │ [1606 frames hidden] pandas, .., numpy, typing, abc, conte... │ │ │ └─ 56.658 new_method pandas/core/ops/common.py:50 │ │ │ [546 frames hidden] pandas, .., numpy, abc, warnings │ │ ├─ 144.549 have_what tlo/methods/symptommanager.py:435 │ │ │ └─ 142.867 apply pandas/core/frame.py:7622 │ │ │ [956 frames hidden] pandas, .., numpy, abc │ │ │ 142.453 apply_series_generator pandas/core/apply.py:281 │ │ │ ├─ 82.435 tlo/methods/symptommanager.py:440 │ │ │ │ └─ 81.897 tlo/methods/symptommanager.py:440 │ │ │ │ └─ 71.790 __getitem__ pandas/core/series.py:837 │ │ │ │ [38 frames hidden] pandas, .. │ │ └─ 76.369 iterrows pandas/core/frame.py:1026 │ │ [375 frames hidden] pandas, .., abc, numpy │ └─ 97.779 apply tlo/methods/malaria.py:557 │ └─ 94.663 schedule_hsi_event tlo/methods/healthsystem.py:385 │ └─ 56.425 tlo/methods/healthsystem.py:500 └─ 46.343 initialise_simulation tlo/methods/hiv.py:557 └─ 46.136 get_time_from_infection_to_aids tlo/methods/hiv.py:862 ```
pyinstrument profiling output for 6 month run using b290f04 (post-PR), 3120s total ``` 3124.388 ../:1 [4 frames hidden] .., runpy 3124.388 _run_code runpy.py:64 └─ 3124.388 scale_run_updated.py:1 └─ 3111.459 simulate tlo/simulation.py:194 ├─ 3066.945 fire_single_event tlo/simulation.py:262 │ └─ 3066.938 run tlo/events.py:32 │ ├─ 1694.334 apply tlo/methods/healthsystem.py:1319 │ │ ├─ 555.128 tlo/methods/healthsystem.py:1397 │ │ │ └─ 535.696 get_appt_footprint_as_time_request tlo/methods/healthsystem.py:734 │ │ │ ├─ 232.383 get_facility_info tlo/methods/healthsystem.py:723 │ │ │ │ ├─ 186.981 __getitem__ pandas/core/indexing.py:2148 │ │ │ │ │ [34 frames hidden] pandas, .. │ │ │ │ └─ 37.195 [self] │ │ │ ├─ 161.598 [self] │ │ │ └─ 132.843 __init__ collections/__init__.py:540 │ │ │ [4 frames hidden] collections │ │ ├─ 348.971 __iadd__ collections/__init__.py:807 │ │ │ [12 frames hidden] collections, .. │ │ ├─ 216.923 heappop ../:0 │ │ │ [2 frames hidden] .. │ │ ├─ 182.151 __getitem__ pandas/core/indexing.py:2148 │ │ │ [34 frames hidden] pandas, .. │ │ ├─ 155.993 [self] │ │ ├─ 100.272 log_hsi_event tlo/methods/healthsystem.py:1003 │ │ │ ├─ 52.150 [self] │ │ │ └─ 35.463 info tlo/logging/core.py:208 │ │ ├─ 47.124 get_squeeze_factors tlo/methods/healthsystem.py:781 │ │ │ └─ 45.622 tlo/methods/healthsystem.py:820 │ │ └─ 38.118 did_not_run tlo/methods/hsi_generic_first_appts.py:335 │ ├─ 1069.676 apply tlo/methods/healthseekingbehaviour.py:193 │ │ ├─ 760.387 schedule_hsi_event tlo/methods/healthsystem.py:385 │ │ │ ├─ 463.801 tlo/methods/healthsystem.py:500 │ │ │ │ ├─ 193.890 all pandas/core/generic.py:10870 │ │ │ │ │ [1796 frames hidden] pandas, .., numpy, typing, abc, inspect │ │ │ │ ├─ 180.673 __getitem__ pandas/core/indexing.py:882 │ │ │ │ │ [1030 frames hidden] pandas, .., numpy, abc, typing, conte... │ │ │ │ ├─ 46.204 new_method pandas/core/ops/common.py:50 │ │ │ │ │ [458 frames hidden] pandas, .., numpy, abc │ │ │ │ └─ 36.275 wrapper pandas/core/strings/accessor.py:93 │ │ │ │ [236 frames hidden] pandas, .., numpy, re │ │ │ ├─ 144.639 __getitem__ pandas/core/indexing.py:882 │ │ │ │ [1606 frames hidden] pandas, .., numpy, typing, abc, conte... │ │ │ └─ 52.143 new_method pandas/core/ops/common.py:50 │ │ │ [546 frames hidden] pandas, .., numpy, abc, warnings │ │ ├─ 139.978 have_what tlo/methods/symptommanager.py:435 │ │ │ └─ 138.370 apply pandas/core/frame.py:7622 │ │ │ [966 frames hidden] pandas, .., numpy, abc │ │ │ 137.956 apply_series_generator pandas/core/apply.py:281 │ │ │ ├─ 78.682 tlo/methods/symptommanager.py:440 │ │ │ │ └─ 78.295 tlo/methods/symptommanager.py:440 │ │ │ │ └─ 69.741 __getitem__ pandas/core/series.py:837 │ │ │ │ [38 frames hidden] pandas, .. │ │ └─ 70.356 iterrows pandas/core/frame.py:1026 │ │ [376 frames hidden] pandas, .., abc, numpy │ └─ 85.117 apply tlo/methods/malaria.py:557 │ └─ 83.166 schedule_hsi_event tlo/methods/healthsystem.py:385 │ └─ 49.937 tlo/methods/healthsystem.py:500 └─ 43.024 initialise_simulation tlo/methods/hiv.py:557 └─ 42.813 get_time_from_infection_to_aids tlo/methods/hiv.py:862 └─ 37.114 predict tlo/lm.py:339 └─ 36.700 eval pandas/core/computation/eval.py:160 [23771 frames hidden] pandas, .., collections, ast, abc, nu... ```

319 (appointment types per facility level as list of sets)

pyinstrument profiling output for 6 month run using ad0e751 (pre-PR), 4990s total ``` 4993.373 ../:1 [4 frames hidden] .., runpy 4993.373 _run_code runpy.py:64 └─ 4993.373 scale_run_arg.py:1 └─ 4979.298 simulate tlo/simulation.py:194 └─ 4932.258 fire_single_event tlo/simulation.py:260 └─ 4932.247 run tlo/events.py:32 ├─ 3127.327 apply tlo/methods/healthsystem.py:1326 │ ├─ 1726.654 tlo/methods/healthsystem.py:1404 │ │ └─ 1691.103 get_appt_footprint_as_time_request tlo/methods/healthsystem.py:735 │ │ ├─ 657.910 get_facility_info tlo/methods/healthsystem.py:724 │ │ │ ├─ 583.035 __getitem__ pandas/core/indexing.py:2148 │ │ │ │ [60 frames hidden] pandas, .. │ │ │ └─ 62.161 [self] │ │ ├─ 446.810 tlo/methods/healthsystem.py:768 │ │ ├─ 262.114 [self] │ │ ├─ 139.141 __getitem__ pandas/core/frame.py:2987 │ │ │ [22 frames hidden] pandas, .. │ │ └─ 138.252 __init__ collections/__init__.py:540 │ │ [4 frames hidden] collections │ ├─ 400.981 __iadd__ collections/__init__.py:807 │ │ [12 frames hidden] collections, .. │ ├─ 259.284 heappop ../:0 │ │ [2 frames hidden] .. │ ├─ 211.322 __getitem__ pandas/core/indexing.py:2148 │ │ [34 frames hidden] pandas, .. │ ├─ 191.852 log_hsi_event tlo/methods/healthsystem.py:1010 │ │ ├─ 83.061 tlo/methods/healthsystem.py:1036 │ │ └─ 59.753 [self] │ ├─ 181.283 [self] │ └─ 52.810 get_squeeze_factors tlo/methods/healthsystem.py:788 │ └─ 51.022 tlo/methods/healthsystem.py:827 ├─ 1431.438 apply tlo/methods/healthseekingbehaviour.py:200 │ ├─ 1042.926 schedule_hsi_event tlo/methods/healthsystem.py:385 │ │ ├─ 604.817 tlo/methods/healthsystem.py:500 │ │ │ ├─ 244.164 all pandas/core/generic.py:10870 │ │ │ │ [1796 frames hidden] pandas, .., typing, numpy, abc, inspect │ │ │ ├─ 239.054 __getitem__ pandas/core/indexing.py:882 │ │ │ │ [1030 frames hidden] pandas, .., numpy, abc, typing, conte... │ │ │ ├─ 61.244 new_method pandas/core/ops/common.py:50 │ │ │ │ [458 frames hidden] pandas, .., numpy, abc │ │ │ └─ 50.281 wrapper pandas/core/strings/accessor.py:93 │ │ │ [236 frames hidden] pandas, .., numpy, re │ │ ├─ 187.281 __getitem__ pandas/core/indexing.py:882 │ │ │ [1606 frames hidden] pandas, .., typing, numpy, abc, conte... │ │ └─ 72.168 new_method pandas/core/ops/common.py:50 │ │ [546 frames hidden] pandas, .., numpy, abc, warnings │ ├─ 154.549 have_what tlo/methods/symptommanager.py:455 │ │ └─ 152.679 apply pandas/core/frame.py:7622 │ │ [1020 frames hidden] pandas, .., numpy, abc │ │ 152.218 apply_series_generator pandas/core/apply.py:281 │ │ ├─ 84.863 tlo/methods/symptommanager.py:460 │ │ │ └─ 84.412 tlo/methods/symptommanager.py:460 │ │ │ └─ 75.163 __getitem__ pandas/core/series.py:837 │ │ │ [38 frames hidden] pandas, .. │ └─ 89.763 iterrows pandas/core/frame.py:1026 │ [373 frames hidden] pandas, .., abc, numpy └─ 118.387 apply tlo/methods/malaria.py:580 └─ 114.423 schedule_hsi_event tlo/methods/healthsystem.py:385 └─ 65.844 tlo/methods/healthsystem.py:500 ```
pyinstrument profiling output for 6 month run using 30d6c1e (post-PR), 3910s total ``` 3908.824 ../:1 [4 frames hidden] .., runpy 3908.824 _run_code runpy.py:64 └─ 3908.824 scale_run_arg.py:1 └─ 3895.191 simulate tlo/simulation.py:194 ├─ 3849.575 fire_single_event tlo/simulation.py:260 │ └─ 3849.570 run tlo/events.py:32 │ ├─ 2858.453 apply tlo/methods/healthsystem.py:1330 │ │ ├─ 1582.785 tlo/methods/healthsystem.py:1408 │ │ │ └─ 1551.660 get_appt_footprint_as_time_request tlo/methods/healthsystem.py:739 │ │ │ ├─ 588.230 get_facility_info tlo/methods/healthsystem.py:728 │ │ │ │ ├─ 524.436 __getitem__ pandas/core/indexing.py:2148 │ │ │ │ │ [60 frames hidden] pandas, .. │ │ │ │ └─ 51.976 [self] │ │ │ ├─ 417.660 tlo/methods/healthsystem.py:772 │ │ │ ├─ 241.726 [self] │ │ │ ├─ 129.434 __getitem__ pandas/core/frame.py:2987 │ │ │ │ [22 frames hidden] pandas, .. │ │ │ └─ 128.753 __init__ collections/__init__.py:540 │ │ │ [4 frames hidden] collections │ │ ├─ 373.016 __iadd__ collections/__init__.py:807 │ │ │ [12 frames hidden] collections, .. │ │ ├─ 233.442 heappop ../:0 │ │ │ [2 frames hidden] .. │ │ ├─ 185.336 __getitem__ pandas/core/indexing.py:2148 │ │ │ [34 frames hidden] pandas, .. │ │ ├─ 174.578 log_hsi_event tlo/methods/healthsystem.py:1014 │ │ │ ├─ 79.488 tlo/methods/healthsystem.py:1040 │ │ │ ├─ 50.608 [self] │ │ │ └─ 39.991 info tlo/logging/core.py:208 │ │ ├─ 165.047 [self] │ │ ├─ 50.289 get_squeeze_factors tlo/methods/healthsystem.py:792 │ │ │ └─ 48.589 tlo/methods/healthsystem.py:831 │ │ └─ 41.363 did_not_run tlo/methods/hsi_generic_first_appts.py:317 │ ├─ 726.546 apply tlo/methods/healthseekingbehaviour.py:200 │ │ ├─ 391.373 schedule_hsi_event tlo/methods/healthsystem.py:393 │ │ │ ├─ 183.215 __getitem__ pandas/core/indexing.py:882 │ │ │ │ [1606 frames hidden] pandas, .., typing, numpy, abc, conte... │ │ │ └─ 61.678 new_method pandas/core/ops/common.py:50 │ │ │ [546 frames hidden] pandas, .., numpy, abc, warnings │ │ ├─ 140.315 have_what tlo/methods/symptommanager.py:455 │ │ │ └─ 138.590 apply pandas/core/frame.py:7622 │ │ │ [949 frames hidden] pandas, .., numpy, abc │ │ │ 138.171 apply_series_generator pandas/core/apply.py:281 │ │ │ ├─ 78.980 tlo/methods/symptommanager.py:460 │ │ │ │ └─ 75.817 tlo/methods/symptommanager.py:460 │ │ │ │ └─ 69.245 __getitem__ pandas/core/series.py:837 │ │ │ │ [38 frames hidden] pandas, .. │ │ └─ 80.101 iterrows pandas/core/frame.py:1026 │ │ [400 frames hidden] pandas, .., abc, numpy │ └─ 43.426 apply tlo/methods/malaria.py:580 │ └─ 40.493 schedule_hsi_event tlo/methods/healthsystem.py:393 └─ 44.027 initialise_simulation tlo/methods/hiv.py:570 └─ 43.783 get_time_from_infection_to_aids tlo/methods/hiv.py:892 ```

322 (caching appointment time requests)

SnakeViz profiling output for 12 month run (pre-PR), 12000s total ![image](https://user-images.githubusercontent.com/6746980/124131344-9cb4f380-da77-11eb-9927-41449337fb2b.png) ![image](https://user-images.githubusercontent.com/6746980/124131496-c5d58400-da77-11eb-86e9-cd20a4d51cf4.png)
SnakeViz profiling output for 12 month run (post-PR), 8170s total ![image](https://user-images.githubusercontent.com/6746980/124135076-624d5580-da7b-11eb-9c2f-58210312c892.png) ![image](https://user-images.githubusercontent.com/6746980/124135546-cf60eb00-da7b-11eb-8e8d-b5a86cd05361.png)

325 (using Counter update method to sum footprints)

SnakeViz profiling output for 12 month run (pre-PR), 12000s total ![image](https://user-images.githubusercontent.com/6746980/127326269-a55cee6d-6e9d-4455-9457-e7ef13ed0a90.png)
SnakeViz profiling output for 12 month run (post-PR), 11300s total ![image](https://user-images.githubusercontent.com/6746980/127326540-04e3b5e9-7710-41ac-9a33-d1fe1c2e9ae0.png)

337 (multiply daily capabilities by capabilities_coefficient once only)

SnakeViz profiling output for 5 year run (pre-PR), 15100s total ![image](https://user-images.githubusercontent.com/6746980/127326283-d4771749-1cc8-483c-8eba-3827cf12dd3b.png)
SnakeViz profiling output for 5 year run (post-PR), 12600s total ![image](https://user-images.githubusercontent.com/6746980/127326548-bd50d75d-930c-44fc-aecf-5b3eec0a9ca4.png)
matt-graham commented 2 years ago

Summary of recent PRs reducing scale_run.py run times:

343 (use set instead of dataframe in check that HSI event footprint is possible)

Snakeviz profiling output for 5 year run (pre-PR), 12600s total ![image](https://user-images.githubusercontent.com/6746980/127328825-b38bcba2-c391-4442-a4db-a1deef2d177e.png) ![image](https://user-images.githubusercontent.com/6746980/127329025-ffceb8ad-040b-4826-8f05-f67a68a6bf40.png)
Snakeviz profiling output for 5 year run (post-PR), 9370s total ![image](https://user-images.githubusercontent.com/6746980/127329096-7df9094f-61f2-4a2f-a506-d17e60cfa78a.png) ![image](https://user-images.githubusercontent.com/6746980/127329275-4a4d4dad-b251-4ab4-837e-94b30bde5a64.png)

346 (vectorising health seeking behaviour poll event)

Snakeviz profiling output for 5 year run (pre-PR), 9370s total ![image](https://user-images.githubusercontent.com/6746980/127346909-51d79d33-1cf1-46a4-9211-77cbc8b110ba.png) ![image](https://user-images.githubusercontent.com/6746980/127346650-af477d92-7d77-4925-b7fe-7d3c21aef6d2.png)
Snakeviz profiling output for 5 year run (post-PR), 6450s total ![image](https://user-images.githubusercontent.com/6746980/127660252-69d1192b-ec9b-42a7-adcb-f379c7112d5c.png) ![image](https://user-images.githubusercontent.com/6746980/127660370-121dfa74-f856-4ca0-9a75-3da3269363bb.png)
matt-graham commented 2 years ago

A full 20 year / 20k population run of scale_run.py with profiling enabled now takes a relatively manageable five and half hours (and probably a bit less without the profiling overhead). We can probably therefore consider increasing the computational demands of the profiling runs. @tbhallett are there any changes you think it would be worth making to the profiling run configurations, e.g. increasing the simulation population or adding in further disease modules?

The SnakeViz breakdown for a 20 year / 20k population run on current master (a691b72) is as follows

image

The ten event methods with the highest proportions of the overall run time are as follows:

  1. healthsystem.HealthSystemScheduler.apply 27.7%
  2. healthseekingbehaviour.HealthSeekingBehaviourPoll.apply 14.7%
  3. pregnancy_supervisor.PregnancySupervisorEvent.apply 8.56%
  4. cardio_metabolic_disorders.CardioMetabolicDisorders_MainPollingEvent.apply 7.16%
  5. symptommanager.SymptomManager_SpuriousSymptomResolve.apply 5.20%
  6. labour.BirthEvent.apply 4.92%
  7. symptommanager.SymptomManager_SpuriousSymptomOnset.apply 4.74%
  8. postnatal_supervisor.PostnatalSupervisorEvent.apply 4.46%
  9. postnatal_supervisor.PostnatalWeekOneEvent.apply 2.82%
  10. malaria.MalariaCureEvent.apply 2.55%

The HealthSystemScheduler.apply and HealthSeekingBehaviourPoll.apply are still the two largest time consumers, taking together around 42.4% of the overall run time, though this is significantly improved compared to prior to some of the recent optimisation when these two methods together constituted the vast majority of the run time.

image

Within HealthSystemScheduler.apply, HealthSystem.impose_beddays_footprint is quite costly constituting 6.36% of overall run time and 23.0% of the time in HealthSystemScheduler.apply, and hopefully the changes #258 will help reduce this and/or enable making further optimizations. HealthSystem.get_squeeze_factors is also relatively costly (3.84% of overall run time, 13.8% of time in HealthSystemScheduler.apply) and there may be scope for some further optimizations there though given there have already been several rounds of optimization of get_squeeze_factors I suspect there is not too much room for further improvement.

image

Within HealthSeekingBehaviourPoll.apply, HealthSystem.schedule_hsi_event is the largest contributor to run time, with 9.19% of the overall run time and around 62.4% of the time in HealthSeekingBehaviourPoll.apply. As commented in #346, as many similar events are scheduled simultaneously in HealthSeekingBehaviourPoll.apply there is potentially room for improvement here by adding a 'batched' schedule_hsi_event equivalent which avoids redundantly repeating checks etc. when many of the attributes of the HSI events being scheduled are shared.

image

Within pregnancy_supervisor.PregnancySupervisorEvent.apply there is no obvious bottleneck with the time split realtively evenly over many different methods so there does not seem to obvious avenues for significant improvement.

image

Within cardio_metabolic_disorders.CardioMetabolicDisorders_MainPollingEvent.apply nearly all the time is being spent in evaluating the predict method of linear models. This suggests there may be some value in either trying to further optimize the linear model predict function (possibly exploring some of the ideas mentioned in #267) in general, with the total time spent in LinearModel.predict 13.8% of the total run time, or specifically in the context of cardio_metabolic_disorders.CardioMetabolicDisorders_MainPollingEvent.apply it may be worth exploring whether optimizing the linear model implementations there e.g. by creating custom predict functions to use in LinearModel.custom, may be worthwhile.

Together the apply methods of the two spurious symptoms related events SymptomManager_SpuriousSymptomResolve and SymptomManager_SpuriousSymptomOnset constitute around 10% of the overall run time and so there is potentially scope for looking at possible improvements in the handling of spurious symptoms.

matt-graham commented 2 years ago

A full 20 year / 20k population run of scale_run.py with profiling enabled now takes a relatively manageable five and half hours (and probably a bit less without the profiling overhead). We can probably therefore consider increasing the computational demands of the profiling runs. @tbhallett are there any changes you think it would be worth making to the profiling run configurations, e.g. increasing the simulation population or adding in further disease modules?

To provide one additional datapoint, a 5 year run of scale_run.py with profiling enabled and an initial population of 50k using the code on 4c51802 (current tip of branch in #354) took 3 hours to complete, so extrapolating, a full 20 year run would be roughly 12 hours which seems reasonable. This was with the settings in scale_run.py otherwise unchanged (in particular no changes were made to the capabilities_coefficient value which I think may need to be adjusted to reflect the change in scaling?).

image

tbhallett commented 2 years ago

Thanks for all this @matt-graham; c. 5h for 20k/20y simulation with all the bells and whistles feels really manageable to me!!

I think going up to 50k would make sense as would be provide a respectable c1800 persons per district.

Yes, there are now new completed disease modules that we should fold into the profiling scrips (notably Alri, BladderCancer and Measles). Should we add these in now? I can do a quick PR. (Shortly three more will be added, which will be the complete set: "Rti", "Tb" and "Malnutrition").

I'll also do a PR that updates the capabilities_coefficient.

Looking through the various things, I see that we rely on simplified_births in many cases. I wonder if we should stop doing that, and let all the profiling rely on the full set of modules for contraceptions/births/labour etc.

But, apart from these thoughts, I don't think we need to change the spec of these runs. Let me know what you and @tamuri think and if you'd need me to do a PR suggesting these changes.

matt-graham commented 2 years ago

Summary of recent PRs reducing scale_run.py run times:

354 (add method to schedule batch of HSI events)

Snakeviz profiling output for 5 year 20k population run (pre-PR), 6260s total ![image](https://user-images.githubusercontent.com/6746980/130928586-2827b77d-cb56-4393-a921-66020f7b3052.png) ![image](https://user-images.githubusercontent.com/6746980/130928746-c91893df-0424-4d07-b77f-f954a0ee93c3.png)
Snakeviz profiling output for 5 year 20k population run (post-PR), 5580s total ![image](https://user-images.githubusercontent.com/6746980/130928448-0f1c850c-bb0b-4fb3-8c6d-71b210201671.png) ![image](https://user-images.githubusercontent.com/6746980/130928697-4511a89e-aaab-4c84-b4c3-9a2421853559.png)

357 (add flags to declare properties of LinearModel predictor conditions)

Snakeviz profiling output for 5 year 20k population run (pre-PR), 6260s total ![image](https://user-images.githubusercontent.com/6746980/130928586-2827b77d-cb56-4393-a921-66020f7b3052.png) ![image](https://user-images.githubusercontent.com/6746980/131145325-30b07c1c-79ff-4a01-b17d-da6edc0b8228.png)
Snakeviz profiling output for 5 year 20k population run (post-PR), 5510s total ![image](https://user-images.githubusercontent.com/6746980/131145200-59156e51-8549-496e-88de-764d95a564c5.png) ![image](https://user-images.githubusercontent.com/6746980/131145603-7e0dfe93-e6e2-4679-9892-29e069cd6246.png)

366 (avoid repeated dataframe accesses in HealthSystemScheduler)

Snakeviz profiling output for 5 year 50k population run (pre-PR), 10100s total ![image](https://user-images.githubusercontent.com/6746980/133443950-170fda71-62bd-495e-befb-78c4badd86de.png) ![image](https://user-images.githubusercontent.com/6746980/133444038-0f316e89-2033-436a-8288-5dea54e6c221.png)
Snakeviz profiling output for 5 year 50k population run (post-PR), 9320s total ![image](https://user-images.githubusercontent.com/6746980/133404873-5cdd1465-fa0e-4442-9d40-0106eb275070.png) ![image](https://user-images.githubusercontent.com/6746980/133404936-b788afc2-61b3-4024-b82c-54a171676af7.png)
matt-graham commented 2 years ago

After the updates to scale_run.py in #363, a 5 year profiled run with the configuration otherwise left as the new defaults (i.e. 50k initial population and capabilities_coefficient set based on ratio of initial population to actual estimated 2010 population), the SnakeViz output for the overall breakdown of the toal run time (9980s) is

image

The ten event methods with the highest proportions of the overall run time are as follows:

  1. healthsystem.HealthSystemScheduler.apply 22.5% (2250s)
  2. healthseekingbehaviour.HealthSeekingBehaviourPoll.apply 18.3% (1820s)
  3. labour.BirthEvent.apply 7.04% (703s)
  4. symptommanager.SymptomManager_SpuriousSymptomOnset.apply 5.41% (540s)
  5. malaria.MalariaCureEvent.apply 5.01% (500s)
  6. pregnancy_supervisor.PregnancySupervisorEvent.apply 5.00% (499s)
  7. symptommanager.SymptomManager_SpuriousSymptomResolve.apply 4.07% (407s)
  8. postnatal_supervisor.PostnatalWeekOneEvent.apply 3.93% (392s)
  9. postnatal_supervisor.PostnatalSupervisorEvent.apply 2.87% (287s)
  10. labour.LabourAtHomeEvent.apply 2.84% (284s)

Some initial thoughts about possible performance improvements

The total time spent in functions / methods in the symptommanager module is now quite high, both in the symptommanager.SymptomManager_SpuriousSymptomOnset and symptommanager.SymptomManager_SpuriousSymptomResolve events themselves, plus virtually all the time spent in malaria.MalariaCureEvent.apply is in running clear_symptoms.

The latter arises in the lines

https://github.com/UCL/TLOmodel/blob/7812c645fd6a177180482db460f040f00355eb4d/src/tlo/methods/malaria.py#L1226-L1229

which clears all the symptoms for a set of person IDs in a loop - it might be there could some performance gain therefore from extending clear_symptoms to allow passing a set / list of integers for the person_id argument.

In clear_symptoms itself, the main work is done in a loop over a set of symptoms, with change_symptom called for each symptom

https://github.com/UCL/TLOmodel/blob/7812c645fd6a177180482db460f040f00355eb4d/src/tlo/methods/symptommanager.py#L499-L505

Running change_symptom for each of a set of symptoms seems to be quite a common pattern

https://github.com/UCL/TLOmodel/blob/7812c645fd6a177180482db460f040f00355eb4d/src/tlo/methods/alri.py#L1013-L1019 https://github.com/UCL/TLOmodel/blob/7812c645fd6a177180482db460f040f00355eb4d/src/tlo/methods/alri.py#L1531-L1537 https://github.com/UCL/TLOmodel/blob/7812c645fd6a177180482db460f040f00355eb4d/src/tlo/methods/diarrhoea.py#L1042-L1053 https://github.com/UCL/TLOmodel/blob/7812c645fd6a177180482db460f040f00355eb4d/src/tlo/methods/malaria.py#L482-L490 https://github.com/UCL/TLOmodel/blob/7812c645fd6a177180482db460f040f00355eb4d/src/tlo/methods/malaria.py#L524-L531 https://github.com/UCL/TLOmodel/blob/7812c645fd6a177180482db460f040f00355eb4d/src/tlo/methods/measles.py#L280-L294 https://github.com/UCL/TLOmodel/blob/7812c645fd6a177180482db460f040f00355eb4d/src/tlo/methods/symptommanager.py#L605-L632 https://github.com/UCL/TLOmodel/blob/7812c645fd6a177180482db460f040f00355eb4d/src/tlo/methods/symptommanager.py#L658-L667

which suggests there might also be some gain to either generalising change_symptom to allow passing a list / set of symptoms to try to avoid the overhead of repeated calls.

A considerable amount of time is being spent in schedule_hsi_event - 1090s which is 11.0% of the overall run time. This is partly due to the high number of individual calls (10342802) to this method, corresponding to an average rate of about 5700 HSI events being scheduled per simulated day; for a population of 50k this seems reasonable (?). Although we have already done several rounds of optimization to schedule_hsi_event, given it is getting called so often it is probably worth doing a further pass to see if there are any even minor optimizations that can be made. From the breakdown of the time spent

image

a considerable amount of time is being spent in operations in the function itself as well as in calling get_appt_footprint_as_time_request.

Another possible target for optimization is the __init__ method of HSI_GenericFirstApptAtFacilityLevel1 which is called 9262660 times (suggesting HSI_GenericFirstApptAtFacilityLevel1 constitute the vast majority of the HSI events being scheduled), with total time spent in this method 482s (4.83% of overall time).

tbhallett commented 2 years ago

Thanks for this Matt

my first quick reactions—-

matt-graham commented 2 years ago

Thanks @tbhallett

  • i totally agree on the making symptom manager accept a set of symptoms to onset for one person or more than one person, and otherwise making that work in batches. I have often thought about doing this but hadn’t clocked how big of a deal it may be for performance.

Okay it seems like this is probably a good first target for me to work on then!

  • I am a little worried that some of the profiling results is due to the rate of spurious symptoms between very high. We don’t know the real rate but the values in there are probably too high when all the disease modules are in. I haven’t got a good answer for that. But just to be aware that the numbers of hits of spurious symptoms is probably too high. In the diarrhoea PR I turn it down a bit (done today!). Perhaps I should pull that into it’s own PR so we can see the effect.

Thanks that's useful to know - I'll hold off on looking at specific optimisation the spurious symptoms related events for now then until we've checked if reducing the rate is sufficient to make these events non-performance critical.

  • re the init of GenericHSI: I had wondered if we could make this into a batch thing, like we did with SpuriousSymptomBatchOnset. ie, one event per day, which does the “work” for each person that is going to have a GenericHSI that day. Every GenericHSI is scheduled by HealthSeekingBehavioir and for the “tomorrow”.. so that module could curate a record of the people instead (???).

Ah that's an interesting idea - given the number of generic HSI events and also the large time spent scheduling them in HealthSeekingBehaviour this sort of approach could be quite a big performance gain, I'll have a look at how this could be done!

matt-graham commented 1 year ago

A 5 year / 50k initial population run of scale_run.py on bfb03a7 gives the following SnakeViz plot of the profiling results

image

with the overall simulation time 25320s.

There seem to be a few parts of some of the newer / rewritten modules that would benefit from some refactoring to improve performance:

  1. In the rti module 4540s (18% of overall run time) is being spent in evaluating the line https://github.com/UCL/TLOmodel/blob/7669163f245bd56616a2934fb900710675f1bdcf/src/tlo/methods/rti.py#L3043 I think the poor performance here is due to the rt_date_to_remove_daly column being of object dtype as it contains a list. This means that operations on the column will fallback to Python loops rather than using faster NumPy array operations. As the lists are all of the same length (8) I think it would make more sense here to have 8 distinct columns of data type (matching the 8 numbered rt_injury_* columns). This would then allow the apply operation here to be replaced with something like
      any_not_null = df.loc[df.is_alive, [f'rt_date_to_remove_daly_{i}' for i in range(8)].notnull().any(axis=1)
  2. In the schisto module 2550s (10% of overall run time) is being spent in evaluating the method SchistoSpecies.update_infectious_status_and_symptoms. The main bottlenecks seem to be in evaluating the line https://github.com/UCL/TLOmodel/blob/7669163f245bd56616a2934fb900710675f1bdcf/src/tlo/methods/schisto.py#L578 and https://github.com/UCL/TLOmodel/blob/7669163f245bd56616a2934fb900710675f1bdcf/src/tlo/methods/schisto.py#L550-L553 The former might warrant looking at if we can do any further optimizations to SymptomManager though from a brief look I can't see anything obvious that can be improved about the clear_symptoms implementation. The _inf_status function used in the apply in the latter consists of a series of boolean logic that could potentially be translated to boolean operations directly on the column / Series objects which might speed things up as I think the apply will be applied to each row individually.
  3. In the labour module 1600s (6% of overall run time) is being spent in evaluating the Labour.predict method. About half of this seems to be in indexing __getitem__ operations. The repeated indexing of the mni object iby person_id n https://github.com/UCL/TLOmodel/blob/7669163f245bd56616a2934fb900710675f1bdcf/src/tlo/methods/labour.py#L1081-L1095 could be avoided by setting mni = self.sim.modules['PregnancySupervisor'].mother_and_newborn_info[person_id] initially though its not clear if this the bottleneck here as looking at the referenced Pandas __getitem__ operation it seems to be for a _LocationIndexer object suggesting it might be the person = df.loc[[person_id]] which is the culprit. If so it is not immediately clear how to reduce the cost.
tamuri commented 1 year ago

Can we get a recent profile viz? I know at least the RTI fix was merged (#682). Thanks.

dimitrasal commented 1 year ago

A 5 year / 40k initial population run of scale_run.py on https://github.com/UCL/TLOmodel/commit/6b9a1d5e8bd7e6817cf479b3fec7391a969a68ae gives the following SnakeViz plot of the profiling results:

Screenshot 2023-03-13 at 10 58 57

with the overall simulation time at ~11500s on an Apple M1 chip.

The ten methods with the highest proportions of the overall time are the following:

  1. labour.BirthandPostnatalOutcomesEvent.apply at 15.66% (1800s )
  2. schisto.SchistoMatureWorms.apply at 10.10% (1160s)
  3. healthseekingbehaviour.HealthseekingBehaviourPoll.apply at 8.95% (1030s)
  4. healthsystem.HealthSystemScheduler.apply at 7.18% (824s)
  5. postnatal_supervisor.PostnatalWeekOneNeonatalEvent.apply at 5.43% (623s)
  6. labour.LabourOnsetEvent.apply at 5.20% (597s)
  7. labour.LabourAtHomeEvent.apply at 4.58% (526s)
  8. postnatal_supervisor.PostnatalSupervisorevent.apply at 4.37% (501s)
  9. postnatal_supervisor.PostnatalWeekOneMaternalEvent.apply at 3.89% (447s)
  10. diarrhoea.DiarrhoeaNaturalRecoveryEvent.apply at 3.6% (413s)

Snakeviz shows larger overheads than pyinstrument with the overall time of a similar run reduced to 9092s. The results with pyinstrument for 5 year/40 k population can be accessed here The results with pyinstrument for 5 year/20 k population can be accessed here

Please note that to make the html files accessible in their current format I had to make them public, let me know if this is an issue.

tamuri commented 1 year ago

Thanks, Dimitra. My vote for two worth exploring are the HealthSeekingBehaviourPoll (4) and then the SchistoMatureWorms event (2). The first seems to be a fairly lengthy LinearModel which is actually not called very often. It might improve if the LinearModel is setup using a custom function. The second looks like an individual event that is called many times. Need to discuss with @tbhallett (perhaps @tdm32 ?) about whether that can be replaced by population-level event without losing important behaviour of the model.

tbhallett commented 1 year ago

On the second point about SchistoMatureWorms, I am sure we could do something using a 'Poll' and do the progression for the whole population at once.

marghe-molaro commented 11 months ago

Making a note that scale runs, which so far have focused on mode_appt_constaints = 1, in the future should also consider mode_appt_constraints = 2 (see PR #986) as this may become more widely used by analysts.

Performance issue already identified is related to the length of the HSI queue, which is dependent on assumptions around tclose (see Issue #999. However note that since this issue was first written up some changes where made in the way the the hsi queue is queried under mode_appt_constraints = 2 which may have improved matters).

matt-graham commented 7 months ago

/run profiling