Closed marghe-molaro closed 3 months ago
Hi @marghe-molaro. Nothing immediately comes to mind with regards to things that could have caused such a substantial change in model behaviour. As far as we're aware the changes to HSI events and generic first appointments in #1289, #1292 and #1315 shouldn't have changed the model output behaviour at all, and similarly changing to running on Python 3.11 on Azure in #1323 also should not have effected the model outputs. But obviously something has so maybe I'm assuming wrongly 😬.
Do you have the specific Git commit SHAs for the two sets of runs (I think should be recorded under the key commit
in the JSON file in the top-level directory containing the batch job results)? With those we could try running (shorter) runs locally and try to see where they start diverging
I totally agree @matt-graham that none of these changes should have made a difference, that's why the difference in outputs is so frustrating!
The two Git commit SHAs are are follows: for effect_of_policy-2024-04-09T132919Z "commit": "c61895880abf8e942635839d7d62ed38c83f6f89", "github": "https://github.com/UCL/TLOmodel/tree/c61895880abf8e942635839d7d62ed38c83f6f89" [draw 0]
and for effect_of_capabilities_scaling-2024-04-29T074640Z "commit": "a91d60c518069ed11477f27fb3c81f1ee92302ca", "github": "https://github.com/UCL/TLOmodel/tree/a91d60c518069ed11477f27fb3c81f1ee92302ca" [matched by draw 6]
If you compare the parameter choice in the JSON file for the draws 0 and 6 respectively you can see they are identical. These correspond to the "No growth" scenarios in the plots above, which clearly differ.
I did a quick run locally for a very small population (100 individuals), and the rescaling factors at the moment of switching modes (2019) were different (around doubled in the more recent version of the PR); this seems to suggest that the difference is starting to emerge before the switch to mode 2/rescaling, however would the PRs on the generic first appointments have affected the sequence of random numbers? If so it might be quite hard to pin down whether the diversion is just down to the small population.
Maybe it would be good to submit a run from the latest branch rolling back the "infinity check" commit, to double check that it is not that creating issues. It's the only PR-specific change that was brought in.
Maybe it would be good to submit a run from the latest branch rolling back the "infinity check" commit, to double check that it is not that creating issues. It's the only PR-specific change that was brought in.
Agree this is probably worth doing, as even if this is unlikely to be the source of the difference it would be good to exclude it.
To check for consistency in simulations I'm running
python src/scripts/profiling/scale_run.py --disable-log-output-to-stdout
with the two commits c61895880abf8e942635839d7d62ed38c83f6f89 and a91d60c518069ed11477f27fb3c81f1ee92302ca with some small changes to src/tlo/simulation.py
to print a hash of the population dataframe after each event has been completed. As calculating the hash takes a little while this is unfortunately quite slow but the outputs after the first 500 events suggest the population dataframes have stayed exactly in sync. If this continues to be the case this suggests it might be something specific to the scenarios rather than the simulation, though it could well be that I just haven't reached the divergence point yet.
Thank you so much Matt. Do you have any intuition as to how many years in the simulation 500 events is? When running locally I was checking the divergence after 8 years. I was doing the local runs using the same scenario files from which I submitted the batches on Azure (with reduced population).
For completeness let me just note that the "infinity check" would only affect things after a mode switch, so the fact that things were starting to diverge earlier would suggest that it's unlikely to be the problem. But I will check directly.
Many thanks again!
The 500 events was less than a month of simulation time, this was very slow to run partly because event rate scales with population (and I was using a relatively large 50 000 initial population) and also because outputting dataframe hash on every event introduces a lot of overhead to the point that computing hashes ends up being the main computational cost. I've re-run with a more sensible configuration of an initial population of 5000 and outputting hashes after only every 100 events. So far after 60 000 events / just over seven years simulation time the population dataframes still appear to be in sync from the reported hashes - I will continue letting this run just to check divergences don't start appearing around the 8 year mark.
Assuming we don't get any later divergences, this suggests the broader changes to the framework in master
that were merged in (particularly the refactoring of HSI events / generic first appointments) have not changed the model behaviour, which is what we expected. This is not immediately helpful in trying to identify what is causing the divergence here, but at least means we can probably ignore these changes when looking for the problem (unless there is some weird interaction with running simulations as part of a scenario that has meant these changes cause differences in model behaviour).
Thank you Matt!
On the "infinity check" side, I can confirm this couldn't have resulted in changes to model behaviour because infinities are already prevented from entering the 'Fraction_Time_Used' column at an earlier stage, namely: summary_by_officer['Fraction_Time_Used'] = ( summary_by_officer['Minutes_Used'] / summary_by_officer['Total_Minutes_Per_Day'] ).replace([np.inf, -np.inf, np.nan], 0.0) which means the added infinity check at the moment of rescaling is redundant.
Just as a quick follow up, the simulations continued to remain in sync up to 10 years simulation time, so I think that means we relatively confidently exclude that the differences are arising due to the changes to HSI events / generic first appointments.
Ok, what's even stranger is that when I run locally from the two branches using the scenario files and enforcing a mode switch after only one year (2011) and assuming 1000 individuals I already see a difference in the rescaling factors:
When I run locally with the python src/scripts/profiling/scale_run.py scenario files, this is what I see on the first day in terms of minutes used:
The commit leading to changes in the "Minutes Used" in this branch is commit d56042fb65a30e258e16d2bcf342bf933963e5bf, which merged master into branch.
Thanks for the further 🕵️ work @marghe-molaro!
If you are seeing the event chain as identical, I wonder if the issue could be down to the assumed footprint? I.e. the sequence of events is identical, but the registered request in terms of minutes has changed.
Yeah I have only been checking for consistency in terms of the values in the population dataframe. For any logged outputs which don't 'feedback' into the model in terms of affecting triggerring or effects of events, then this wouldn't be captured by what I was checking. I'll try set something up to see if/when the logged capabilities start diverging.
The commit leading to changes in the "Minutes Used" in this branch is commit https://github.com/UCL/TLOmodel/commit/d56042fb65a30e258e16d2bcf342bf933963e5bf, which merged master into branch.
Thanks for isolating this. Given this doesn't touch healthsystem.py
it not obvious to me what would be causing these differences, but I'll see if stepping through simulations reveals anything!
Hi @marghe-molaro.
Running
python src/scripts/profiling/scale_run.py --years 10 --disable-log-output-to-stdout --initial-population 5000
on a91d60c51 with a line
print(summary_by_officer.loc["Clinical"])
(choosing "Clinical"
here just to keep output manageably sized and because there were differences evident for this officer type in output you recorded) added to Healthsystem.log_current_capabilities_and_usage
I get output after the 10000th event
10000 2011-04-17 00:00:00 13314295d95889a198c06d870d40a548244f2124
Total_Minutes_Per_Day Minutes_Used Fraction_Time_Used
Facility_Level
0 0.000000 0.00 0.000000
1a 128.345357 380.24 2.962632
2 270.021738 132.24 0.489738
3 113.389117 40.00 0.352768
4 1.158937 0.00 0.000000
5 2.679878 0.00 0.000000
Total_Minutes_Per_Day Minutes_Used Fraction_Time_Used
Facility_Level
0 0.000000 0.00 0.000000
1a 128.345357 357.06 2.782025
2 270.021738 259.12 0.959626
3 113.389117 20.00 0.176384
4 1.158937 0.00 0.000000
5 2.679878 0.00 0.000000
Total_Minutes_Per_Day Minutes_Used Fraction_Time_Used
Facility_Level
0 0.000000 0.00 0.000000
1a 128.345357 435.56 3.393656
2 270.021738 352.12 1.304043
3 113.389117 60.00 0.529151
4 1.158937 0.00 0.000000
5 2.679878 0.00 0.000000
Total_Minutes_Per_Day Minutes_Used Fraction_Time_Used
Facility_Level
0 0.000000 0.00 0.000000
1a 128.345357 379.12 2.953905
2 270.021738 341.00 1.262861
3 113.389117 40.00 0.352768
4 1.158937 0.00 0.000000
5 2.679878 0.00 0.000000
and running the equivalent on c61895880 we get
10000 2011-04-17 00:00:00 13314295d95889a198c06d870d40a548244f2124
Total_Minutes_Per_Day Minutes_Used Fraction_Time_Used
Facility_Level
0 0.000000 0.00 0.000000
1a 128.345357 380.24 2.962632
2 270.021738 132.24 0.489738
3 113.389117 40.00 0.352768
4 1.158937 0.00 0.000000
5 2.679878 0.00 0.000000
Total_Minutes_Per_Day Minutes_Used Fraction_Time_Used
Facility_Level
0 0.000000 0.00 0.000000
1a 128.345357 357.06 2.782025
2 270.021738 259.12 0.959626
3 113.389117 20.00 0.176384
4 1.158937 0.00 0.000000
5 2.679878 0.00 0.000000
Total_Minutes_Per_Day Minutes_Used Fraction_Time_Used
Facility_Level
0 0.000000 0.00 0.000000
1a 128.345357 435.56 3.393656
2 270.021738 352.12 1.304043
3 113.389117 60.00 0.529151
4 1.158937 0.00 0.000000
5 2.679878 0.00 0.000000
Total_Minutes_Per_Day Minutes_Used Fraction_Time_Used
Facility_Level
0 0.000000 0.00 0.000000
1a 128.345357 379.12 2.953905
2 270.021738 341.00 1.262861
3 113.389117 40.00 0.352768
4 1.158937 0.00 0.000000
5 2.679878 0.00 0.000000
which seems to be identical (and from some quick eyeballing I can't spot any divergences anywhere else in output).
Can I check what commands you were running @marghe-molaro where you got the differences you posted above in the logged capabilities?
I get the difference when running python src/scripts/profiling/scale_run.py.
Admittedly when running python src/scripts/profiling/scale_run.py --years 10 --disable-log-output-to-stdout --initial-population 5000 I get the same result (on the first day of the sim at least)
Just copying it here to double check I'm not going out of my mind. "first" and "second" component-comparison are just the names of the copies of the two commits I made.
You can see when using python src/scripts/profiling/scale_run.py --disable-log-output-to-stdout I get different results, whereas when using python src/scripts/profiling/scale_run.py --years 10 --disable-log-output-to-stdout --initial-population 5000 the output is the same.
Thanks @marghe-molaro that's helpful. The default initial population is much larger than 5000 so it's possible it's something to do with this. Currently away from keyboard but will check when back if I see differences running with default arguments. If it's some rare event causing the discrepancy that could explain why only evident at larger populations possibly?
Thank you so much @matt-graham! Just to add to the general confusion, when I was running locally with a different scenario file and forcing the population to be quite small (<=1000) I was still seeing discrepancy on day one...so I'm really confused as to why it's still not showing up for you with a population of 5000.
PS before you do anything else, please check if you can recover the discrepancy between the commits on the first day of the sim when running with python src/scripts/profiling/scale_run.py --disable-log-output-to-stdout, so far I'm the only one seeing these discrepancy so I'd like to make sure you see them too!
Hmm odd, I don't seem to get any discrepancies running
python src/scripts/profiling/scale_run.py --disable-log-output-to-stdout
On a91d60c51
0 2010-01-01 00:00:00 4deda7ea3a711e16534247a1b63b93453cdf6bd5
Total_Minutes_Per_Day Minutes_Used Fraction_Time_Used
Facility_Level
0 0.000000 0.00 0.000000
1a 1283.453567 654.75 0.510147
2 2700.217379 1184.00 0.438483
3 1133.891165 0.00 0.000000
4 11.589373 0.00 0.000000
5 26.798780 0.00 0.000000
100 2010-01-02 00:00:00 768f2b3fff2c7973c0776758f3340c35415f6134
Total_Minutes_Per_Day Minutes_Used Fraction_Time_Used
Facility_Level
0 0.000000 0.00 0.000000
1a 1283.453567 802.75 0.625461
2 2700.217379 1023.00 0.378858
3 1133.891165 1195.50 1.054334
4 11.589373 0.00 0.000000
5 26.798780 0.00 0.000000
and on c61895880
0 2010-01-01 00:00:00 4deda7ea3a711e16534247a1b63b93453cdf6bd5
Total_Minutes_Per_Day Minutes_Used Fraction_Time_Used
Facility_Level
0 0.000000 0.00 0.000000
1a 1283.453567 654.75 0.510147
2 2700.217379 1184.00 0.438483
3 1133.891165 0.00 0.000000
4 11.589373 0.00 0.000000
5 26.798780 0.00 0.000000
100 2010-01-02 00:00:00 768f2b3fff2c7973c0776758f3340c35415f6134
Total_Minutes_Per_Day Minutes_Used Fraction_Time_Used
Facility_Level
0 0.000000 0.00 0.000000
1a 1283.453567 802.75 0.625461
2 2700.217379 1023.00 0.378858
3 1133.891165 1195.50 1.054334
4 11.589373 0.00 0.000000
5 26.798780 0.00 0.000000
Do you have a clean working tree where you're running these?
Interestingly the figures I get seem to be the same as for your molaro/second-component-comparison
branch.
This is my commit log for molaro/first-component-comparison. commit a6cfbd4c91209ace210c3a282d43837358b99cfd is on this branch before the master merge, the two "Test" commits just modify the healthsystem to print the summary log.
Just realised most of the checks for consistency I performed previously were bogus as I was running one set of results from an additional working tree, but as I was using the same conda
environment for both in which the tlo
package was installed in editable mode by pointing to the main working tree, running a script from the additional working tree still used the tlo
source code from the main working tree :facepalm:.
After now trying to run
python src/scripts/profiling/scale_run.py --disable-log-output-to-stdout
from the two commits with a line print(summary_by_officer.loc["Clinical"])
added to Healthsystem.log_current_capabilities_and_usage
in both cases I get the same differences in output as @marghe-molaro has been reporting
On a91d60c51
Total_Minutes_Per_Day Minutes_Used Fraction_Time_Used
Facility_Level
0 0.000000 0.00 0.000000
1a 1283.453567 654.75 0.510147
2 2700.217379 1184.00 0.438483
3 1133.891165 0.00 0.000000
4 11.589373 0.00 0.000000
5 26.798780 0.00 0.000000
Total_Minutes_Per_Day Minutes_Used Fraction_Time_Used
Facility_Level
0 0.000000 0.00 0.000000
1a 1283.453567 802.75 0.625461
2 2700.217379 1023.00 0.378858
3 1133.891165 1195.50 1.054334
4 11.589373 0.00 0.000000
5 26.798780 0.00 0.000000
and on c61895880
Total_Minutes_Per_Day Minutes_Used Fraction_Time_Used
Facility_Level
0 0.000000 0.0 0.000000
1a 1283.453567 618.0 0.481513
2 2700.217379 1184.0 0.438483
3 1133.891165 0.0 0.000000
4 11.589373 0.0 0.000000
5 26.798780 0.0 0.000000
Total_Minutes_Per_Day Minutes_Used Fraction_Time_Used
Facility_Level
0 0.000000 0.00 0.000000
1a 1283.453567 693.81 0.540581
2 2700.217379 1023.00 0.378858
3 1133.891165 797.00 0.702889
4 11.589373 0.00 0.000000
5 26.798780 0.00 0.000000
Repeating the above with population dataframe hash also being printed out every 100 events we get output on c61895880
0 2010-01-01 00:00:00 4deda7ea3a711e16534247a1b63b93453cdf6bd5
Total_Minutes_Per_Day Minutes_Used Fraction_Time_Used
Facility_Level
0 0.000000 0.0 0.000000
1a 1283.453567 618.0 0.481513
2 2700.217379 1184.0 0.438483
3 1133.891165 0.0 0.000000
4 11.589373 0.0 0.000000
5 26.798780 0.0 0.000000
100 2010-01-02 00:00:00 3ff9fc28a196f298b69314ac3c31ae1d95db5623
Total_Minutes_Per_Day Minutes_Used Fraction_Time_Used
Facility_Level
0 0.000000 0.00 0.000000
1a 1283.453567 693.81 0.540581
2 2700.217379 1023.00 0.378858
3 1133.891165 797.00 0.702889
4 11.589373 0.00 0.000000
5 26.798780 0.00 0.000000
and on a91d60c51
0 2010-01-01 00:00:00 4deda7ea3a711e16534247a1b63b93453cdf6bd5
Total_Minutes_Per_Day Minutes_Used Fraction_Time_Used
Facility_Level
0 0.000000 0.00 0.000000
1a 1283.453567 654.75 0.510147
2 2700.217379 1184.00 0.438483
3 1133.891165 0.00 0.000000
4 11.589373 0.00 0.000000
5 26.798780 0.00 0.000000
100 2010-01-02 00:00:00 768f2b3fff2c7973c0776758f3340c35415f6134
Total_Minutes_Per_Day Minutes_Used Fraction_Time_Used
Facility_Level
0 0.000000 0.00 0.000000
1a 1283.453567 802.75 0.625461
2 2700.217379 1023.00 0.378858
3 1133.891165 1195.50 1.054334
4 11.589373 0.00 0.000000
5 26.798780 0.00 0.000000
So the population dataframes are diverging. This does suggest something in the merge commit d56042f @marghe-molaro previously identified as being where the difference appear has caused a change in model behaviour.
To try to isolate which commit to master that was merged in caused the differences, I created a script population_checksum_check_for_bisect.py
which checks the population dataframe checksum against a reference value for a short simulation of 1 month with 1000 initial population (after finding this does give different checksums on the relevant commits)
REFERENCE_CHECKSUM=dc273b2e0ef27a3d61afb954ede48b8ba038a9a7
YEARS=0
MONTHS=1
INITIAL_POPULATION=1000
python src/scripts/profiling/scale_run.py --years $YEARS --months $MONTHS --log-final-population-checksum --initial-population $INITIAL_POPULATION | tail -n 1 | grep -q "Population checksum: ${REFERENCE_CHECKSUM}"
Then running git bisect
with
git checkout 82a02e9740f4129cd5301b2be717bab4eab7a717
git bisect start
git bisect bad
git checkout 012ef6d7fd1e5635ce1987abd70cda0f8d5e4ae9
git bisect good
git bisect run ./population_checksum_check_for_bisect.py
which uses 012ef6d7fd1e5635ce1987abd70cda0f8d5e4ae9 as 'good' commit (commit on master #1308 was initially branched off from) and 82a02e9740f4129cd5301b2be717bab4eab7a717 as 'bad' commit (commit on master at point it was merged in to #1308 in d56042fb65a30e258e16d2bcf342bf933963e5bf) with this identifying bd0a7632e1b0add9ae0ae8d3afc2a151bdd15f7d (merge of #1292) as the relevant first 'bad' commit.
So looks like #1292 introduced some changes to the model behaviour inadvertently. @willGraham01 do you think you might be able to help with tracking down what in #1292 might have led to the differences? We could probably do a git bisect
similar to above on the branch #1292 was based on to isolate which commit in that branch the differences first arose.
Looks like some of the commits in wgraham/1237-speedup-concept
don't allow running the check script as there are errors in the code which means the scale_run.py
script fails to run. Might need to adjust the check script to return a special exit code 125 if scale_run.py
fails to complete without error, exit code 0 if it completes and gives correct checksum and exit code 1 if it completes but gives wrong checksum.
After some slight edits to @matt-graham's script above to account for the 125 exit codes,
#!/bin/bash
REFERENCE_CHECKSUM=dc273b2e0ef27a3d61afb954ede48b8ba038a9a7
YEARS=0
MONTHS=1
INITIAL_POPULATION=1000
RUN_CHECKSUM=$( \
python src/scripts/profiling/scale_run.py \
--years $YEARS --months $MONTHS --log-final-population-checksum \
--initial-population $INITIAL_POPULATION | tail -n 1 \
| grep -oP "(?<=checksum: ).*(?=\"])" \
)
if [ $? -ne 0 ]
then
echo "Python script failed, commit cannot be tested"
exit 125
fi
echo "Run checksum was: $RUN_CHECKSUM"
echo "Checksum to match: $REFERENCE_CHECKSUM"
if [[ "$RUN_CHECKSUM" == "$REFERENCE_CHECKSUM" ]]
then
echo "That's a match"
else
echo "No match, fail"
exit 1
fi
and then bisecting along the wgraham/1237-speedup-concept
branch;
git checkout wgraham/1237-speedup-concept
git bisect start
git bisect bad
git checkout 012ef6d
git bisect good
git bisect run <the-script-above>
we get that the first "bad" commit is 3863c0b2c9120c274eea6033b255a694e0f0df9a
. I'll take a look at this commit in more detail and report back!
3863c0b2c9120c274eea6033b255a694e0f0df9a
- some logic translation in check_if_fever_is_caused_by_maleria
. Discrepancy fixed but tests need updating.
0fd30ba22
introduces a change in the number of random samples generated by the stunting module, which is also causing a divergence. Fixed this.
Thank you @willGraham01! Which PR are these fixes coming in from, #1292? It would be good to know when they have been merged in master!
Yep these are fixes coming in from #1292 - not sure why they weren't picked up in the tests but it mostly seems like I inadvertently changed some of the logic surrounding when to generate random numbers by combining if
statements.
I'm working through the whole #1292 branch now, cherry-picking each commit in sequence and checking against the test we have above. The fix will be in https://github.com/UCL/TLOmodel/pull/1366, will tag you when it's ready for review.
Thank you @willGraham01 - so just to clarify was the issue just one of random number generation? Really struggling to understand why that would have led to the significant changes in described at the top of this PR.
not sure why they weren't picked up in the tests
This a good point and suggests we might want to look at adding tests that would have flagged this.
One generic option would be to add a test which is similar to the script we've been using with git bisect
which checks the population dataframe checksum at the end of a short simulation against a reference value. While in some cases we will expect changes to the model to change the population dataframe, this would just require PRs making such changes to explicitly update the reference checksum, but by default we would get a fail flagged if a PR unexpectedly changes population dataframe. This might get difficult to manage though if we have PRs which change the reference checksum and are merging into master after other PRs which also change the reference checksum (and were merged in after the first PR was branched off). An alternative would be to just have the final population dataframe checksum listed as an additional output on profiling runs summary table as this would at least give us some indication of when model behaviour changes.
Thank you @willGraham01 - so just to clarify was the issue just one of random number generation? Really struggling to understand why that would have led to the significant changes in described at the top of this PR.
I'm still unsure, but working through commit by commit I'm encountering some ideas as to why.
At present, the head of https://github.com/UCL/TLOmodel/pull/1366 (3946eaa) passes the script above that we're using to test. I've just starting trying to apply commit 712f978
. This commit refactors CardioMetabolicDisorders
into the generic_first_appt
framework.
Applying this commit results in another divergence from the "checksum check" we are doing.
However the changes introduced in this commit are essentially just renaming methods:
CardioMetabolicDisorders.determine_if_will_be_investigated
to CardioMetabolicDisorders.do_at_generic_first_appt
(plus adding an person_age > 5
check at the start).CardioMetabolicDisorders.determine_if_will_be_investigated_events
to CardioMetabolicDisorders.do_at_generic_first_appt_emergency
As far as I can tell, this refactoring is correct (it just renames the two functions we're already using and changes the call signatures).
More subtly though, it means the order the modules run their generic appointment function changes. The modules run their .do_at_generic_first_appt
method in whatever order the simulation object stores the Modules
in its dictionary, so before this commit, we had a workflow like this:
for module in sim.modules:
if <the module has been refactored to have a do_at_generic_first_appt method>:
module.do_at_generic_first_appt()
# Handle all other modules in no particular order, one of which is
CardioMetabolicDisorders.determine_if_will_be_investigated()
After this commit, we now have a workflow like this:
for module in sim.modules:
if <the module has been refactored to have a do_at_generic_first_appt method>:
module.do_at_generic_first_appt()
# This step now includes CardioMetabolicDisorders, so in particular it runs
# its method before some of the other modules that also use RNG.
# Handle all other modules in no particular order,
# which no longer includes CardioMetabolicDisorders.determine_if_will_be_investigated()
However it seems the order in which the modules are calling do_at_generic_first_appt
does indeed matter since the final result is now changing when they run in different orders.
I'm continuing to dig deeper here, since this is the first commit I've encountered where I'm confident the code refactoring is correct but the checksum script is repeatedly saying a difference has emerged.
Further to the above; commit 24567b9c is passing. Note that on lines 140-180 I have manually excluded CardioMetabolicDisorders
from being run as part of the module loop, and instead run it afterwards.
If I then remove lines 164-180, which is the execution of CardioMetabolicDisorders
outside the loop, and also remove the lines which exclude it from the main loop, then the code fails the checksum check.
Some observations:
CardioMetabolicDisorders
itself is editing any shared values either.As such, this leads me to conclude that it must be some kind of artefact with how we are generating random numbers. Still investigating...
Update: have narrowed it down to something to do with the Malaria
module; by forcing the loop to run the modules in a set order, and moving the position of CardioMetabolicDisorders
, I have found a conflict when the order of these two modules is swapped.
@marghe-molaro Think I've got to the bottom of this, but it raises a few modelling questions.
HSI
events are added to the queue. HealthSystem
runs HSI
events - which then results in a knock on effect.Unless the old order that the do_at_generic_first_appt
methods handled the Modules
was in fact a design choice, I think the differences that we're seeing are just forward propagation of this different order of adding follow-up HSI
events to the queue.
So I'm not sure what the correct solution here is, as it seems to raise a modelling question as well as a programmatic one. Based on the above, I am confident I could now continue to apply the changes in #1292 commit-by-commit, and so long as I retain this "original ordering" between the modules I can preserve the original behaviour and reconcile this with the refactoring. Merging the result into master
would presumably then make the batch results align. But then we need to answer the question as to why the model depends on the order in which the Module
s are considered (which I believe we were dependent on anyway, but now it is quite clearly broadcast).
Hi @willGraham01, thank you so much for this!
The idea that this would be related to an issue with the ordering of HSIs in the queue makes sense because the difference in population-wide health outputs (see top of this PR) is most visible in mode 2 - when being at the top of the queue "pays off" - rather than in mode 1.
What I find very confusing though is that the order in which HSIs are added to the queue is explicitly randomised (see randomise_queue parameter) exactly to prevent such artificial effects from emerging. The question is then why does the randomisation fail to truly prevent such effects from occurring.
Clarification: these artificial effects would be prevented at a statistical level in a large population; in a small population they might well lead to differences in outcomes for the few individuals considered.
I will have a look and a think.
Sorry @willGraham01, is the "long version" example you are describing running in mode 1 or mode 2?
What I find very confusing though is that the order in which HSIs are added to the queue is explicitly randomised (see randomise_queue parameter) exactly to prevent such artificial effects from emerging. The question is then why does the randomisation fail to truly prevent such effects from occurring.
This does seem odd, but I think this it could still be consistent with what we're seeing here? Swapping the order CMD
and Malaria
are added to the queue -> randomise_queue
now assigns Malaria
a higher "equal" priority than the HIV
event -> starts the divergence. Since all the RNG is seeded separately we'd be doing:
"Good code":
HIV
for person 466 (randomise_queue
gives priority 2, for sake or argument)Malaria
for person 865 (randomise_queue
gives priority 3)CMD
for person 865 (randomise_queue
gives priority 1)CMD
, HIV
, Malaria
which doesn't run due to resource constraints.With the same RNG,
"Refactored code":
HIV
for person 466 (priority 2)CMD
for person 865 (priority 3)Malaria
for person 865 (priority 1)Malaria
, HIV
which doesn't run due to resource constraints, CMD
which runs because it doesn't need the same resourcesI don't know if this is what's happening, but its a possibility that's consistent with my understanding of randomise_queue
. It only works of course because it's the same person (865) being assigned two subsequent events.
Sorry @willGraham01, is the "long version" example you are describing running in mode 1 or mode 2?
Everything here is running mode 2
What I find very confusing though is that the order in which HSIs are added to the queue is explicitly randomised (see randomise_queue parameter) exactly to prevent such artificial effects from emerging. The question is then why does the randomisation fail to truly prevent such effects from occurring.
This does seem odd, but I think this it could still be consistent with what we're seeing here? Swapping the order
CMD
andMalaria
are added to the queue ->randomise_queue
now assignsMalaria
a higher "equal" priority than theHIV
event -> starts the divergence.
Sorry, you're absolutely right, I've now clarified above that what I mean is that at a statistical level/for large populations this shouldn't make a difference - i.e. outcome of a specific/few individual(s) may be different (which is what you're seeing) but if considering large enough populations those differences should cancel out if the ordering of the queue for equal-priority events is truly random. Could this be an indication that we're not considering large enough populations ?
By the same token as @marghe-molaro says, I am wondering whether the tests being used for the git bisect
is sufficiently capturing the unexpected behaviour that @marghe-molaro notes in her runs. Might it be useful to either (i) do a large run again with the changes made so far and see if we recover the original behaviour; (ii) write a super-discriminatory test (e.g. in a large population over a short-run, compare squeeze factors against a reference value). ...?
By the same token as @marghe-molaro says, I am wondering whether the tests being used for the
git bisect
is sufficiently capturing the unexpected behaviour that @marghe-molaro notes in her runs. Might it be useful to either (i) do a large run again with the changes made so far and see if we recover the original behaviour; (ii) write a super-discriminatory test (e.g. in a large population over a short-run, compare squeeze factors against a reference value). ...?
Just to point out @tbhallett that wrt (ii), squeeze factors will remain constant even as the HSI queue is changed, because squeeze factors are only meaningful in mode=1, where the refactoring of the queue doesn't lead to any differences anyway.
As for (i) I can merge the changes that @willGraham01 has just made and then resubmit a new batch of runs in mode 2 (with and without rescaling, for curiosity) to check how things look. @willGraham01, have these changes already been merged in master?
And @willGraham01, can I double check first whether in these changes you are now forcing the order of the modules in the generic_first_appts to be the same as it used to be originally?
As for (i) I can merge the changes that @willGraham01 has just made and then resubmit a new batch of runs in mode 2 (with and without rescaling, for curiosity) to check how things look. @willGraham01, have these changes already been merged in master?
And @willGraham01, can I double check first whether in these changes you are now forcing the order of the modules in the generic_first_appts to be the same as it used to be originally?
The branch I'm working on is https://github.com/UCL/TLOmodel/pull/1366, it hasn't been merged into master
yet. This commit forces the "old" order of the modules though, and everything seems to be consistent with 012ef6d
(commit before #1292 was merged).
@marghe-molaro feel free to merge #1366 into an experimental branch, but it 100% shouldn't be merged into master
yet :sweat_smile: I also haven't had a chance to work through the remaining commits in #1292 (to resolve the merge conflicts with master
) and thus, you'll get a lot of merge conflicts too!
I can do this, this afternoon & tomorrow morning, and @marghe-molaro I can notify you when you should be able to merge #1366 into an experimental branch, and set off a batch job like you suggested to see how things look?
Perfect, sounds great @willGraham01! I'll be on standby for #1366.
@marghe-molaro #1366 now has master
merged in and passes the script we've been using here. You can see the (now explicit) ordering of the modules in the generic appointments code.
Hopefully you should be able to merge this into your own branch :crossed_fingers: and run some scenarios to see if we've found the cause.
Hi @willGraham01 @matt-graham @tbhallett,
The test run from #1366 has now completed. We seem indeed to recover the behaviour that we were observing at the start (compare red line here with "No growth" case in left plot at the start of this Issue), i.e. when rescaling mode 2 there is no dramatic increase in AIDS DALYs after switching modes .
While it is reassuring that we recover the original results, if the true cause of the discrepancy was the order in which the modules are called during the first generic appointment it is concerning that this would be observed on 100,000 population sized runs. However @willGraham01 if I understood correctly there were some other minor fixes you brought in with #1366, so maybe/hopefully the blame, for some for now non-identified reason, was with those?
We could maybe resubmit a run after reshuffling the order in which modules are called in #1366 to check whether this really is the source of the discrepancy? If we confirmed this to be the case this artificial effect would require some quite careful consideration.
While it is reassuring that we recover the original results, if the true cause of the discrepancy was the order in which the modules are called during the first generic appointment it is concerning that this would be observed on 100,000 population sized runs. However @willGraham01 if I understood correctly there were some other minor fixes you brought in with #1366, so maybe/hopefully the blame, for some for now non-identified reason, was with those?
We could maybe resubmit a run after reshuffling the order in which modules are called in #1366 to check whether this really is the source of the discrepancy? If we confirmed this to be the case this artificial effect would require some quite careful consideration.
Strong support these sentiments and the desire to understand if/how the ordering of processing in the first appointments has such a big effect.
However @willGraham01 if I understood correctly there were some other minor fixes you brought in with https://github.com/UCL/TLOmodel/pull/1366, so maybe/hopefully the blame, for some for now non-identified reason, was with those?
Will's away this week, but as far as I'm aware other than the module order change in #1366, the other changes were mainly changes to ensure calls to random number generator happened in same order to give consistent results (which would hopefully also not have a major affect on distribution of results with a large population). From quickly scanning the diff, the one change I can see that might affect model behaviour is a change to a conditional statement in the diarrhoea module to add an additional clause (have just tagged you both in a comment on this line as couldn't see an easy way to link from here).
Thank you @matt-graham! I was hoping you might be able to help me clarify something while @willGraham01 is away: in one of his previous replies he said that:
- The failure cannot be because one of the modules in the main loop is editing any shared dataframe properties, since lines 164-180 occur after the dataframe update from the main loop is applied, but still use the "cached" values that the main loop utilised.
My understanding though is that prior @willGraham01's revision, modules were not only using shared dataframe properties during _do_on_generic_first_appt, but also modifying them, hence the need for this loop to exist at the end of the function:
if proposed_patient_details_updates:
df.loc[
self.target, proposed_patient_details_updates.keys()
] = proposed_patient_details_updates.values()
Moving to the use of "cached" values would therefore necessarily lead to differences, as modules would now be accessing the individual's original parameters, rather than those updated by the modules called before inside _do_on_generic_first_appt. Am I missing something?
It may be a moot point, but I also noticed that this section, will lead to the updating not working as expected if more than module is trying to update the same property. (The updates from the last module will "win".)
As was mentioned in the catch-up meeting, this is now a blocker on several analyses. I'm at a workshop today, but will have time tomorrow to dig into it. I think if the situation looks difficult to fix quickly we should rollback the refactoring changes so epi team can continue on with their work.
Hi @tamuri, do you have any suggestions as to how we should proceed? If we do roll back is that going to come in through master? Many thanks in advance!
PR #1308 allows user to rescale HR capabilities to capture effective capabilities when transitioning to mode 2.
Two batch of runs where submitted to Azure from this PR, effect_of_policy-2024-04-09T132919Z (2 runs per draw) and effect_of_capabilities_scaling-2024-04-29T074640Z (10 runs per draw). While the rescaling seems to be occurring in both sets of runs, the evolution of DALYs recovered is much different when all other factors (cons. availability, HR used, etc...) are identical, as shown in plots below.
The two branches from which these jobs were submitted produce identical outputs to the test test_rescaling_capabilities_based_on_squeeze_factors, suggesting difference does not arise from stylistic changes in how rescaling is computed in more recent branch suggested by @matt-graham. While the more recent branch includes checks on refactoring not being infinite, which would have occurred in the case of some mental health care workers, this should not be affecting DALYs related to HIV/AIDS shown here which do not rely on this medical worker class.
Other differences between branches include changes brought over by master including:
@matt-graham any suggestions on things to double check would be very appreciated!