Batch of runs submitted from PR #1308 before/after review&master merge inconsistent

marghe-molaro commented 4 months ago

PR #1308 allows user to rescale HR capabilities to capture effective capabilities when transitioning to mode 2.

Two batch of runs where submitted to Azure from this PR, effect_of_policy-2024-04-09T132919Z (2 runs per draw) and effect_of_capabilities_scaling-2024-04-29T074640Z (10 runs per draw). While the rescaling seems to be occurring in both sets of runs, the evolution of DALYs recovered is much different when all other factors (cons. availability, HR used, etc...) are identical, as shown in plots below. Scaling of capabilities

The two branches from which these jobs were submitted produce identical outputs to the test test_rescaling_capabilities_based_on_squeeze_factors, suggesting difference does not arise from stylistic changes in how rescaling is computed in more recent branch suggested by @matt-graham. While the more recent branch includes checks on refactoring not being infinite, which would have occurred in the case of some mental health care workers, this should not be affecting DALYs related to HIV/AIDS shown here which do not rely on this medical worker class.

Other differences between branches include changes brought over by master including:

Refactoring of generic first appointments
Transition to python 3.11
new scenario switcher class

@matt-graham any suggestions on things to double check would be very appreciated!

matt-graham commented 4 months ago

Hi @marghe-molaro. Nothing immediately comes to mind with regards to things that could have caused such a substantial change in model behaviour. As far as we're aware the changes to HSI events and generic first appointments in #1289, #1292 and #1315 shouldn't have changed the model output behaviour at all, and similarly changing to running on Python 3.11 on Azure in #1323 also should not have effected the model outputs. But obviously something has so maybe I'm assuming wrongly 😬.

Do you have the specific Git commit SHAs for the two sets of runs (I think should be recorded under the key commit in the JSON file in the top-level directory containing the batch job results)? With those we could try running (shorter) runs locally and try to see where they start diverging

marghe-molaro commented 4 months ago

I totally agree @matt-graham that none of these changes should have made a difference, that's why the difference in outputs is so frustrating!

The two Git commit SHAs are are follows: for effect_of_policy-2024-04-09T132919Z "commit": "c61895880abf8e942635839d7d62ed38c83f6f89", "github": "https://github.com/UCL/TLOmodel/tree/c61895880abf8e942635839d7d62ed38c83f6f89" [draw 0]

and for effect_of_capabilities_scaling-2024-04-29T074640Z "commit": "a91d60c518069ed11477f27fb3c81f1ee92302ca", "github": "https://github.com/UCL/TLOmodel/tree/a91d60c518069ed11477f27fb3c81f1ee92302ca" [matched by draw 6]

If you compare the parameter choice in the JSON file for the draws 0 and 6 respectively you can see they are identical. These correspond to the "No growth" scenarios in the plots above, which clearly differ.

I did a quick run locally for a very small population (100 individuals), and the rescaling factors at the moment of switching modes (2019) were different (around doubled in the more recent version of the PR); this seems to suggest that the difference is starting to emerge before the switch to mode 2/rescaling, however would the PRs on the generic first appointments have affected the sequence of random numbers? If so it might be quite hard to pin down whether the diversion is just down to the small population.

Maybe it would be good to submit a run from the latest branch rolling back the "infinity check" commit, to double check that it is not that creating issues. It's the only PR-specific change that was brought in.

matt-graham commented 4 months ago

Maybe it would be good to submit a run from the latest branch rolling back the "infinity check" commit, to double check that it is not that creating issues. It's the only PR-specific change that was brought in.

Agree this is probably worth doing, as even if this is unlikely to be the source of the difference it would be good to exclude it.

To check for consistency in simulations I'm running

python src/scripts/profiling/scale_run.py --disable-log-output-to-stdout

with the two commits c61895880abf8e942635839d7d62ed38c83f6f89 and a91d60c518069ed11477f27fb3c81f1ee92302ca with some small changes to src/tlo/simulation.py to print a hash of the population dataframe after each event has been completed. As calculating the hash takes a little while this is unfortunately quite slow but the outputs after the first 500 events suggest the population dataframes have stayed exactly in sync. If this continues to be the case this suggests it might be something specific to the scenarios rather than the simulation, though it could well be that I just haven't reached the divergence point yet.

marghe-molaro commented 4 months ago

Thank you so much Matt. Do you have any intuition as to how many years in the simulation 500 events is? When running locally I was checking the divergence after 8 years. I was doing the local runs using the same scenario files from which I submitted the batches on Azure (with reduced population).

For completeness let me just note that the "infinity check" would only affect things after a mode switch, so the fact that things were starting to diverge earlier would suggest that it's unlikely to be the problem. But I will check directly.

Many thanks again!

matt-graham commented 4 months ago

The 500 events was less than a month of simulation time, this was very slow to run partly because event rate scales with population (and I was using a relatively large 50 000 initial population) and also because outputting dataframe hash on every event introduces a lot of overhead to the point that computing hashes ends up being the main computational cost. I've re-run with a more sensible configuration of an initial population of 5000 and outputting hashes after only every 100 events. So far after 60 000 events / just over seven years simulation time the population dataframes still appear to be in sync from the reported hashes - I will continue letting this run just to check divergences don't start appearing around the 8 year mark.

Assuming we don't get any later divergences, this suggests the broader changes to the framework in master that were merged in (particularly the refactoring of HSI events / generic first appointments) have not changed the model behaviour, which is what we expected. This is not immediately helpful in trying to identify what is causing the divergence here, but at least means we can probably ignore these changes when looking for the problem (unless there is some weird interaction with running simulations as part of a scenario that has meant these changes cause differences in model behaviour).

marghe-molaro commented 4 months ago

Thank you Matt!

On the "infinity check" side, I can confirm this couldn't have resulted in changes to model behaviour because infinities are already prevented from entering the 'Fraction_Time_Used' column at an earlier stage, namely: summary_by_officer['Fraction_Time_Used'] = ( summary_by_officer['Minutes_Used'] / summary_by_officer['Total_Minutes_Per_Day'] ).replace([np.inf, -np.inf, np.nan], 0.0) which means the added infinity check at the moment of rescaling is redundant.

matt-graham commented 4 months ago

Just as a quick follow up, the simulations continued to remain in sync up to 10 years simulation time, so I think that means we relatively confidently exclude that the differences are arising due to the changes to HSI events / generic first appointments.

marghe-molaro commented 4 months ago

Ok, what's even stranger is that when I run locally from the two branches using the scenario files and enforcing a mode switch after only one year (2011) and assuming 1000 individuals I already see a difference in the rescaling factors:

marghe-molaro commented 4 months ago

marghe-molaro commented 4 months ago

When I run locally with the python src/scripts/profiling/scale_run.py scenario files, this is what I see on the first day in terms of minutes used:

Minutes used at level 2 appear to be consistent between two branches;
Minutes used at level 1a appear to be similar but not identical between the two branches; If you are seeing the event chain as identical, I wonder if the issue could be down to the assumed footprint? I.e. the sequence of events is identical, but the registered request in terms of minutes has changed.

marghe-molaro commented 4 months ago

The commit leading to changes in the "Minutes Used" in this branch is commit d56042fb65a30e258e16d2bcf342bf933963e5bf, which merged master into branch.

matt-graham commented 4 months ago

Thanks for the further 🕵️ work @marghe-molaro!

If you are seeing the event chain as identical, I wonder if the issue could be down to the assumed footprint? I.e. the sequence of events is identical, but the registered request in terms of minutes has changed.

Yeah I have only been checking for consistency in terms of the values in the population dataframe. For any logged outputs which don't 'feedback' into the model in terms of affecting triggerring or effects of events, then this wouldn't be captured by what I was checking. I'll try set something up to see if/when the logged capabilities start diverging.

The commit leading to changes in the "Minutes Used" in this branch is commit https://github.com/UCL/TLOmodel/commit/d56042fb65a30e258e16d2bcf342bf933963e5bf, which merged master into branch.

Thanks for isolating this. Given this doesn't touch healthsystem.py it not obvious to me what would be causing these differences, but I'll see if stepping through simulations reveals anything!

matt-graham commented 4 months ago

Hi @marghe-molaro.

Running

python src/scripts/profiling/scale_run.py --years 10 --disable-log-output-to-stdout --initial-population 5000

on a91d60c51 with a line

print(summary_by_officer.loc["Clinical"])

(choosing "Clinical" here just to keep output manageably sized and because there were differences evident for this officer type in output you recorded) added to Healthsystem.log_current_capabilities_and_usage I get output after the 10000th event

10000 2011-04-17 00:00:00 13314295d95889a198c06d870d40a548244f2124
                Total_Minutes_Per_Day  Minutes_Used  Fraction_Time_Used
Facility_Level
0                            0.000000          0.00            0.000000
1a                         128.345357        380.24            2.962632
2                          270.021738        132.24            0.489738
3                          113.389117         40.00            0.352768
4                            1.158937          0.00            0.000000
5                            2.679878          0.00            0.000000
                Total_Minutes_Per_Day  Minutes_Used  Fraction_Time_Used
Facility_Level
0                            0.000000          0.00            0.000000
1a                         128.345357        357.06            2.782025
2                          270.021738        259.12            0.959626
3                          113.389117         20.00            0.176384
4                            1.158937          0.00            0.000000
5                            2.679878          0.00            0.000000
                Total_Minutes_Per_Day  Minutes_Used  Fraction_Time_Used
Facility_Level
0                            0.000000          0.00            0.000000
1a                         128.345357        435.56            3.393656
2                          270.021738        352.12            1.304043
3                          113.389117         60.00            0.529151
4                            1.158937          0.00            0.000000
5                            2.679878          0.00            0.000000
                Total_Minutes_Per_Day  Minutes_Used  Fraction_Time_Used
Facility_Level
0                            0.000000          0.00            0.000000
1a                         128.345357        379.12            2.953905
2                          270.021738        341.00            1.262861
3                          113.389117         40.00            0.352768
4                            1.158937          0.00            0.000000
5                            2.679878          0.00            0.000000

and running the equivalent on c61895880 we get

10000 2011-04-17 00:00:00 13314295d95889a198c06d870d40a548244f2124
                Total_Minutes_Per_Day  Minutes_Used  Fraction_Time_Used
Facility_Level
0                            0.000000          0.00            0.000000
1a                         128.345357        380.24            2.962632
2                          270.021738        132.24            0.489738
3                          113.389117         40.00            0.352768
4                            1.158937          0.00            0.000000
5                            2.679878          0.00            0.000000
                Total_Minutes_Per_Day  Minutes_Used  Fraction_Time_Used
Facility_Level
0                            0.000000          0.00            0.000000
1a                         128.345357        357.06            2.782025
2                          270.021738        259.12            0.959626
3                          113.389117         20.00            0.176384
4                            1.158937          0.00            0.000000
5                            2.679878          0.00            0.000000
                Total_Minutes_Per_Day  Minutes_Used  Fraction_Time_Used
Facility_Level
0                            0.000000          0.00            0.000000
1a                         128.345357        435.56            3.393656
2                          270.021738        352.12            1.304043
3                          113.389117         60.00            0.529151
4                            1.158937          0.00            0.000000
5                            2.679878          0.00            0.000000
                Total_Minutes_Per_Day  Minutes_Used  Fraction_Time_Used
Facility_Level
0                            0.000000          0.00            0.000000
1a                         128.345357        379.12            2.953905
2                          270.021738        341.00            1.262861
3                          113.389117         40.00            0.352768
4                            1.158937          0.00            0.000000
5                            2.679878          0.00            0.000000

which seems to be identical (and from some quick eyeballing I can't spot any divergences anywhere else in output).

Can I check what commands you were running @marghe-molaro where you got the differences you posted above in the logged capabilities?

marghe-molaro commented 4 months ago

I get the difference when running python src/scripts/profiling/scale_run.py.

Admittedly when running python src/scripts/profiling/scale_run.py --years 10 --disable-log-output-to-stdout --initial-population 5000 I get the same result (on the first day of the sim at least)

marghe-molaro commented 4 months ago

Just copying it here to double check I'm not going out of my mind. "first" and "second" component-comparison are just the names of the copies of the two commits I made.

You can see when using python src/scripts/profiling/scale_run.py --disable-log-output-to-stdout I get different results, whereas when using python src/scripts/profiling/scale_run.py --years 10 --disable-log-output-to-stdout --initial-population 5000 the output is the same.

matt-graham commented 4 months ago

Thanks @marghe-molaro that's helpful. The default initial population is much larger than 5000 so it's possible it's something to do with this. Currently away from keyboard but will check when back if I see differences running with default arguments. If it's some rare event causing the discrepancy that could explain why only evident at larger populations possibly?

marghe-molaro commented 4 months ago

Thank you so much @matt-graham! Just to add to the general confusion, when I was running locally with a different scenario file and forcing the population to be quite small (<=1000) I was still seeing discrepancy on day one...so I'm really confused as to why it's still not showing up for you with a population of 5000.

marghe-molaro commented 4 months ago

PS before you do anything else, please check if you can recover the discrepancy between the commits on the first day of the sim when running with python src/scripts/profiling/scale_run.py --disable-log-output-to-stdout, so far I'm the only one seeing these discrepancy so I'd like to make sure you see them too!

matt-graham commented 4 months ago

Hmm odd, I don't seem to get any discrepancies running

python src/scripts/profiling/scale_run.py --disable-log-output-to-stdout

On a91d60c51

0 2010-01-01 00:00:00 4deda7ea3a711e16534247a1b63b93453cdf6bd5
                Total_Minutes_Per_Day  Minutes_Used  Fraction_Time_Used
Facility_Level
0                            0.000000          0.00            0.000000
1a                        1283.453567        654.75            0.510147
2                         2700.217379       1184.00            0.438483
3                         1133.891165          0.00            0.000000
4                           11.589373          0.00            0.000000
5                           26.798780          0.00            0.000000
100 2010-01-02 00:00:00 768f2b3fff2c7973c0776758f3340c35415f6134
                Total_Minutes_Per_Day  Minutes_Used  Fraction_Time_Used
Facility_Level
0                            0.000000          0.00            0.000000
1a                        1283.453567        802.75            0.625461
2                         2700.217379       1023.00            0.378858
3                         1133.891165       1195.50            1.054334
4                           11.589373          0.00            0.000000
5                           26.798780          0.00            0.000000

and on c61895880

0 2010-01-01 00:00:00 4deda7ea3a711e16534247a1b63b93453cdf6bd5
                Total_Minutes_Per_Day  Minutes_Used  Fraction_Time_Used
Facility_Level
0                            0.000000          0.00            0.000000
1a                        1283.453567        654.75            0.510147
2                         2700.217379       1184.00            0.438483
3                         1133.891165          0.00            0.000000
4                           11.589373          0.00            0.000000
5                           26.798780          0.00            0.000000
100 2010-01-02 00:00:00 768f2b3fff2c7973c0776758f3340c35415f6134
                Total_Minutes_Per_Day  Minutes_Used  Fraction_Time_Used
Facility_Level
0                            0.000000          0.00            0.000000
1a                        1283.453567        802.75            0.625461
2                         2700.217379       1023.00            0.378858
3                         1133.891165       1195.50            1.054334
4                           11.589373          0.00            0.000000
5                           26.798780          0.00            0.000000

Do you have a clean working tree where you're running these?

matt-graham commented 4 months ago

Interestingly the figures I get seem to be the same as for your molaro/second-component-comparison branch.

marghe-molaro commented 4 months ago

This is my commit log for molaro/first-component-comparison. commit a6cfbd4c91209ace210c3a282d43837358b99cfd is on this branch before the master merge, the two "Test" commits just modify the healthsystem to print the summary log.

matt-graham commented 4 months ago

Just realised most of the checks for consistency I performed previously were bogus as I was running one set of results from an additional working tree, but as I was using the same conda environment for both in which the tlo package was installed in editable mode by pointing to the main working tree, running a script from the additional working tree still used the tlo source code from the main working tree :facepalm:.

After now trying to run

python src/scripts/profiling/scale_run.py --disable-log-output-to-stdout

from the two commits with a line print(summary_by_officer.loc["Clinical"]) added to Healthsystem.log_current_capabilities_and_usage in both cases I get the same differences in output as @marghe-molaro has been reporting

On a91d60c51

                Total_Minutes_Per_Day  Minutes_Used  Fraction_Time_Used
Facility_Level
0                            0.000000          0.00            0.000000
1a                        1283.453567        654.75            0.510147
2                         2700.217379       1184.00            0.438483
3                         1133.891165          0.00            0.000000
4                           11.589373          0.00            0.000000
5                           26.798780          0.00            0.000000
                Total_Minutes_Per_Day  Minutes_Used  Fraction_Time_Used
Facility_Level
0                            0.000000          0.00            0.000000
1a                        1283.453567        802.75            0.625461
2                         2700.217379       1023.00            0.378858
3                         1133.891165       1195.50            1.054334
4                           11.589373          0.00            0.000000
5                           26.798780          0.00            0.000000

and on c61895880

                Total_Minutes_Per_Day  Minutes_Used  Fraction_Time_Used
Facility_Level
0                            0.000000           0.0            0.000000
1a                        1283.453567         618.0            0.481513
2                         2700.217379        1184.0            0.438483
3                         1133.891165           0.0            0.000000
4                           11.589373           0.0            0.000000
5                           26.798780           0.0            0.000000
                Total_Minutes_Per_Day  Minutes_Used  Fraction_Time_Used
Facility_Level
0                            0.000000          0.00            0.000000
1a                        1283.453567        693.81            0.540581
2                         2700.217379       1023.00            0.378858
3                         1133.891165        797.00            0.702889
4                           11.589373          0.00            0.000000
5                           26.798780          0.00            0.000000

matt-graham commented 4 months ago

Repeating the above with population dataframe hash also being printed out every 100 events we get output on c61895880

0 2010-01-01 00:00:00 4deda7ea3a711e16534247a1b63b93453cdf6bd5
                Total_Minutes_Per_Day  Minutes_Used  Fraction_Time_Used
Facility_Level
0                            0.000000           0.0            0.000000
1a                        1283.453567         618.0            0.481513
2                         2700.217379        1184.0            0.438483
3                         1133.891165           0.0            0.000000
4                           11.589373           0.0            0.000000
5                           26.798780           0.0            0.000000
100 2010-01-02 00:00:00 3ff9fc28a196f298b69314ac3c31ae1d95db5623
                Total_Minutes_Per_Day  Minutes_Used  Fraction_Time_Used
Facility_Level
0                            0.000000          0.00            0.000000
1a                        1283.453567        693.81            0.540581
2                         2700.217379       1023.00            0.378858
3                         1133.891165        797.00            0.702889
4                           11.589373          0.00            0.000000
5                           26.798780          0.00            0.000000

and on a91d60c51

0 2010-01-01 00:00:00 4deda7ea3a711e16534247a1b63b93453cdf6bd5
                Total_Minutes_Per_Day  Minutes_Used  Fraction_Time_Used
Facility_Level
0                            0.000000          0.00            0.000000
1a                        1283.453567        654.75            0.510147
2                         2700.217379       1184.00            0.438483
3                         1133.891165          0.00            0.000000
4                           11.589373          0.00            0.000000
5                           26.798780          0.00            0.000000
100 2010-01-02 00:00:00 768f2b3fff2c7973c0776758f3340c35415f6134
                Total_Minutes_Per_Day  Minutes_Used  Fraction_Time_Used
Facility_Level
0                            0.000000          0.00            0.000000
1a                        1283.453567        802.75            0.625461
2                         2700.217379       1023.00            0.378858
3                         1133.891165       1195.50            1.054334
4                           11.589373          0.00            0.000000
5                           26.798780          0.00            0.000000

So the population dataframes are diverging. This does suggest something in the merge commit d56042f @marghe-molaro previously identified as being where the difference appear has caused a change in model behaviour.

matt-graham commented 4 months ago

To try to isolate which commit to master that was merged in caused the differences, I created a script population_checksum_check_for_bisect.py which checks the population dataframe checksum against a reference value for a short simulation of 1 month with 1000 initial population (after finding this does give different checksums on the relevant commits)

REFERENCE_CHECKSUM=dc273b2e0ef27a3d61afb954ede48b8ba038a9a7
YEARS=0
MONTHS=1
INITIAL_POPULATION=1000
python src/scripts/profiling/scale_run.py --years $YEARS --months $MONTHS --log-final-population-checksum --initial-population $INITIAL_POPULATION | tail -n 1 | grep -q "Population checksum: ${REFERENCE_CHECKSUM}"

Then running git bisect with

git checkout 82a02e9740f4129cd5301b2be717bab4eab7a717
git bisect start
git bisect bad
git checkout 012ef6d7fd1e5635ce1987abd70cda0f8d5e4ae9
git bisect good
git bisect run ./population_checksum_check_for_bisect.py

which uses 012ef6d7fd1e5635ce1987abd70cda0f8d5e4ae9 as 'good' commit (commit on master #1308 was initially branched off from) and 82a02e9740f4129cd5301b2be717bab4eab7a717 as 'bad' commit (commit on master at point it was merged in to #1308 in d56042fb65a30e258e16d2bcf342bf933963e5bf) with this identifying bd0a7632e1b0add9ae0ae8d3afc2a151bdd15f7d (merge of #1292) as the relevant first 'bad' commit.

So looks like #1292 introduced some changes to the model behaviour inadvertently. @willGraham01 do you think you might be able to help with tracking down what in #1292 might have led to the differences? We could probably do a git bisect similar to above on the branch #1292 was based on to isolate which commit in that branch the differences first arose.

matt-graham commented 4 months ago

Looks like some of the commits in wgraham/1237-speedup-concept don't allow running the check script as there are errors in the code which means the scale_run.py script fails to run. Might need to adjust the check script to return a special exit code 125 if scale_run.py fails to complete without error, exit code 0 if it completes and gives correct checksum and exit code 1 if it completes but gives wrong checksum.

willGraham01 commented 4 months ago

After some slight edits to @matt-graham's script above to account for the 125 exit codes,

#!/bin/bash

REFERENCE_CHECKSUM=dc273b2e0ef27a3d61afb954ede48b8ba038a9a7
YEARS=0
MONTHS=1
INITIAL_POPULATION=1000
RUN_CHECKSUM=$( \
    python src/scripts/profiling/scale_run.py \
    --years $YEARS --months $MONTHS --log-final-population-checksum \
    --initial-population $INITIAL_POPULATION | tail -n 1 \
    | grep -oP "(?<=checksum: ).*(?=\"])" \
)

if [ $? -ne 0 ]
then
    echo "Python script failed, commit cannot be tested"
    exit 125
fi

echo "Run checksum was:  $RUN_CHECKSUM"
echo "Checksum to match: $REFERENCE_CHECKSUM"

if [[ "$RUN_CHECKSUM" == "$REFERENCE_CHECKSUM" ]]
then
    echo "That's a match"
else
    echo "No match, fail"
    exit 1
fi

and then bisecting along the wgraham/1237-speedup-concept branch;

git checkout wgraham/1237-speedup-concept
git bisect start
git bisect bad
git checkout 012ef6d
git bisect good
git bisect run <the-script-above>

we get that the first "bad" commit is 3863c0b2c9120c274eea6033b255a694e0f0df9a. I'll take a look at this commit in more detail and report back!

Fixes

3863c0b2c9120c274eea6033b255a694e0f0df9a - some logic translation in check_if_fever_is_caused_by_maleria. Discrepancy fixed but tests need updating.

0fd30ba22 introduces a change in the number of random samples generated by the stunting module, which is also causing a divergence. Fixed this.

marghe-molaro commented 4 months ago

Thank you @willGraham01! Which PR are these fixes coming in from, #1292? It would be good to know when they have been merged in master!

willGraham01 commented 3 months ago

Yep these are fixes coming in from #1292 - not sure why they weren't picked up in the tests but it mostly seems like I inadvertently changed some of the logic surrounding when to generate random numbers by combining if statements.

I'm working through the whole #1292 branch now, cherry-picking each commit in sequence and checking against the test we have above. The fix will be in https://github.com/UCL/TLOmodel/pull/1366, will tag you when it's ready for review.

marghe-molaro commented 3 months ago

Thank you @willGraham01 - so just to clarify was the issue just one of random number generation? Really struggling to understand why that would have led to the significant changes in described at the top of this PR.

matt-graham commented 3 months ago

not sure why they weren't picked up in the tests

This a good point and suggests we might want to look at adding tests that would have flagged this.

One generic option would be to add a test which is similar to the script we've been using with git bisect which checks the population dataframe checksum at the end of a short simulation against a reference value. While in some cases we will expect changes to the model to change the population dataframe, this would just require PRs making such changes to explicitly update the reference checksum, but by default we would get a fail flagged if a PR unexpectedly changes population dataframe. This might get difficult to manage though if we have PRs which change the reference checksum and are merging into master after other PRs which also change the reference checksum (and were merged in after the first PR was branched off). An alternative would be to just have the final population dataframe checksum listed as an additional output on profiling runs summary table as this would at least give us some indication of when model behaviour changes.

willGraham01 commented 3 months ago

Thank you @willGraham01 - so just to clarify was the issue just one of random number generation? Really struggling to understand why that would have led to the significant changes in described at the top of this PR.

I'm still unsure, but working through commit by commit I'm encountering some ideas as to why.

At present, the head of https://github.com/UCL/TLOmodel/pull/1366 (3946eaa) passes the script above that we're using to test. I've just starting trying to apply commit 712f978. This commit refactors CardioMetabolicDisorders into the generic_first_appt framework. Applying this commit results in another divergence from the "checksum check" we are doing. However the changes introduced in this commit are essentially just renaming methods:

CardioMetabolicDisorders.determine_if_will_be_investigated to CardioMetabolicDisorders.do_at_generic_first_appt (plus adding an person_age > 5 check at the start).
CardioMetabolicDisorders.determine_if_will_be_investigated_events to CardioMetabolicDisorders.do_at_generic_first_appt_emergency

As far as I can tell, this refactoring is correct (it just renames the two functions we're already using and changes the call signatures).

More subtly though, it means the order the modules run their generic appointment function changes. The modules run their .do_at_generic_first_appt method in whatever order the simulation object stores the Modules in its dictionary, so before this commit, we had a workflow like this:

for module in sim.modules:
  if <the module has been refactored to have a do_at_generic_first_appt method>:
    module.do_at_generic_first_appt()

# Handle all other modules in no particular order, one of which is

CardioMetabolicDisorders.determine_if_will_be_investigated()

After this commit, we now have a workflow like this:

for module in sim.modules:
  if <the module has been refactored to have a do_at_generic_first_appt method>:
    module.do_at_generic_first_appt()
    # This step now includes CardioMetabolicDisorders, so in particular it runs
    # its method before some of the other modules that also use RNG.

# Handle all other modules in no particular order,
# which no longer includes CardioMetabolicDisorders.determine_if_will_be_investigated()

However it seems the order in which the modules are calling do_at_generic_first_appt does indeed matter since the final result is now changing when they run in different orders.

I was under the impression that every module had its own, independent random number generator, so I doubt it is anything to do with the order of random numbers being generated.
The only other thing that comes to mind is that one of the modules wants to update a property in the dataframe that another module then reads. This would introduce a dependency that we are not accounting for in the new framework (since we apply all DF updates after every module has run this method, and not in between).

I'm continuing to dig deeper here, since this is the first commit I've encountered where I'm confident the code refactoring is correct but the checksum script is repeatedly saying a difference has emerged.

willGraham01 commented 3 months ago

Further to the above; commit 24567b9c is passing. Note that on lines 140-180 I have manually excluded CardioMetabolicDisorders from being run as part of the module loop, and instead run it afterwards.

If I then remove lines 164-180, which is the execution of CardioMetabolicDisorders outside the loop, and also remove the lines which exclude it from the main loop, then the code fails the checksum check.

Some observations:

The failure cannot be because one of the modules in the main loop is editing any shared dataframe properties, since lines 164-180 occur after the dataframe update from the main loop is applied, but still use the "cached" values that the main loop utilised.
The failure is not because CardioMetabolicDisorders itself is editing any shared values either.

As such, this leads me to conclude that it must be some kind of artefact with how we are generating random numbers. Still investigating...

Update: have narrowed it down to something to do with the Malaria module; by forcing the loop to run the modules in a set order, and moving the position of CardioMetabolicDisorders, I have found a conflict when the order of these two modules is swapped.

willGraham01 commented 3 months ago

@marghe-molaro Think I've got to the bottom of this, but it raises a few modelling questions.

Short version

The refactored method in #1292 is correct, and the order the modules run in doesn't effect the events that are created.
However, it does affect the order in which these follow-up HSI events are added to the queue.
This appears to lead to different resolutions to resource conflicts as the HealthSystem runs HSI events - which then results in a knock on effect.

Long Version

Everything I say here is in relation to [24567b9c](https://github.com/UCL/TLOmodel/tree/24567b9c6835854aac6adbab1d3e17b4f3897261), which is attempting to refactor the `CardioMetabolicDisorders` into the generic framework. From my logging, we can see that person `865`'s `Malaria` treatment appears to switch places with person `466`'s `HIV` treatment. After 1 month I'm seeing ~3 rows with differences in the population dataframe for an initial pop of 1000, so I can believe that for the longer simulations this has a butterfly effect. - The first difference (in pop dataframe and `HSI` queue) occurs on `2010-01-26`. This is when patient `865` seeks a generic appointment, and is scheduled both a `Malaria` follow up _and_ `CardioMetabolicDisorders` (`CMD`) follow-up. - Patient `865`'s `CMD` follow-up runs the same day in both simulations, however their `Malaria` follow up conflicts with person `466`'s `HIV` follow-up. - In the original code (scheduling the `Malaria` follow up before the `CMD`), person `466` has received their `HIV` treatment before the end of `2010-01-26`, and `865` is still waiting on their `Malaria` follow-up. In the refactored code (`CMD` before `Malaria`) person `466`'s `HIV` treatment has _not_ been able to run, and `865` has received their `Malaria` follow-up. - All subsequent days then have differences in the `HSI` queue and dataframe, which are caused by these changes having a knock-on effect. This is likely due to how the `HSI` events are scheduled based on the available resources at the beginning of each day. - From the logs, we can see that until `2010-01-28`, most of the changes are just the two follow-ups I mentioned above switching places. However on `2010-01-28`, 466's `HIV` treatment then requires a `TB` appointment, which then starts to conflict with the resources allocated to patients `984`, `344`, `325`. From here, more differences start to emerge. I'll also point out that we're only seeing this effect at such a small population size once - I imagine in the larger simulation this competition for resources is far more common.

Questions

Unless the old order that the do_at_generic_first_appt methods handled the Modules was in fact a design choice, I think the differences that we're seeing are just forward propagation of this different order of adding follow-up HSI events to the queue.

The refactoring in #1292 has inadvertently introduced a change in the order HSI events, stemming from the same generic-first-appointment, are added to the queue.
When a person first seeks a generic appointment, they're checked by all the disease modules. The order of this checking isn't part of the model assumptions, as far as I'm aware.
In order to retain the old behaviour, we have to enforce this "set execution order" for the modules when they look at a person seeking a generic first appointment. As far as I can tell, this order was implemented arbitrarily in the first place. But us having this order seems to be particularly impactful on long simulations.

So I'm not sure what the correct solution here is, as it seems to raise a modelling question as well as a programmatic one. Based on the above, I am confident I could now continue to apply the changes in #1292 commit-by-commit, and so long as I retain this "original ordering" between the modules I can preserve the original behaviour and reconcile this with the refactoring. Merging the result into master would presumably then make the batch results align. But then we need to answer the question as to why the model depends on the order in which the Modules are considered (which I believe we were dependent on anyway, but now it is quite clearly broadcast).

Logs

The event queue at the end of each day, running the same simulation with the old code ("good") and refactored code ("bad"). The difference is the output of `diff -u bad.log good.log` which shows the lines which are different between the two simulations. [queue-diff.log](https://github.com/UCL/TLOmodel/files/15415193/queue-diff.log) [good-hsi-queue-log.log](https://github.com/UCL/TLOmodel/files/15415194/good-hsi-queue-log.log) [bad-hsi-queue-log.log](https://github.com/UCL/TLOmodel/files/15415195/bad-hsi-queue-log.log) The (Bash) script being run at the repository root to match the checksum of the final population dataframe, which will pass when run on the "good" code but fail on the "bad" code: ```bash #!/bin/bash # Good commit: 012ef6d, checksum should be dc273b2e0ef27a3d61afb954ede48b8ba038a9a7 REFERENCE_CHECKSUM=dc273b2e0ef27a3d61afb954ede48b8ba038a9a7 YEARS=0 MONTHS=1 INITIAL_POPULATION=1000 RUN_CHECKSUM=$( \ python src/scripts/profiling/scale_run.py \ --years $YEARS --months $MONTHS --log-final-population-checksum \ --initial-population $INITIAL_POPULATION | tail -n 1 \ | grep -oP "(?<=checksum: ).*(?=\"])" \ ) if [ $? -ne 0 ] then echo "Python script failed" exit 125 fi echo "Run checksum was: $RUN_CHECKSUM" echo "Checksum to match: $REFERENCE_CHECKSUM" if [[ "$RUN_CHECKSUM" == "$REFERENCE_CHECKSUM" ]] then echo "That's a match" else echo "No match, fail" exit 1 fi ```

marghe-molaro commented 3 months ago

Hi @willGraham01, thank you so much for this!

The idea that this would be related to an issue with the ordering of HSIs in the queue makes sense because the difference in population-wide health outputs (see top of this PR) is most visible in mode 2 - when being at the top of the queue "pays off" - rather than in mode 1.

What I find very confusing though is that the order in which HSIs are added to the queue is explicitly randomised (see randomise_queue parameter) exactly to prevent such artificial effects from emerging. The question is then why does the randomisation fail to truly prevent such effects from occurring.

Clarification: these artificial effects would be prevented at a statistical level in a large population; in a small population they might well lead to differences in outcomes for the few individuals considered.

I will have a look and a think.

marghe-molaro commented 3 months ago

Sorry @willGraham01, is the "long version" example you are describing running in mode 1 or mode 2?

willGraham01 commented 3 months ago

What I find very confusing though is that the order in which HSIs are added to the queue is explicitly randomised (see randomise_queue parameter) exactly to prevent such artificial effects from emerging. The question is then why does the randomisation fail to truly prevent such effects from occurring.

This does seem odd, but I think this it could still be consistent with what we're seeing here? Swapping the order CMD and Malaria are added to the queue -> randomise_queue now assigns Malaria a higher "equal" priority than the HIV event -> starts the divergence. Since all the RNG is seeded separately we'd be doing:

"Good code":

Schedule HIV for person 466 (randomise_queue gives priority 2, for sake or argument)
Schedule Malaria for person 865 (randomise_queue gives priority 3)
Schedule CMD for person 865 (randomise_queue gives priority 1)
Resulting order: CMD, HIV, Malaria which doesn't run due to resource constraints.

With the same RNG,

"Refactored code":

Schedule HIV for person 466 (priority 2)
Schedule CMD for person 865 (priority 3)
Schedule Malaria for person 865 (priority 1)
Resulting order: Malaria, HIV which doesn't run due to resource constraints, CMD which runs because it doesn't need the same resources

I don't know if this is what's happening, but its a possibility that's consistent with my understanding of randomise_queue. It only works of course because it's the same person (865) being assigned two subsequent events.

willGraham01 commented 3 months ago

Sorry @willGraham01, is the "long version" example you are describing running in mode 1 or mode 2?

Everything here is running mode 2

marghe-molaro commented 3 months ago

What I find very confusing though is that the order in which HSIs are added to the queue is explicitly randomised (see randomise_queue parameter) exactly to prevent such artificial effects from emerging. The question is then why does the randomisation fail to truly prevent such effects from occurring.

This does seem odd, but I think this it could still be consistent with what we're seeing here? Swapping the order CMD and Malaria are added to the queue -> randomise_queue now assigns Malaria a higher "equal" priority than the HIV event -> starts the divergence.

Sorry, you're absolutely right, I've now clarified above that what I mean is that at a statistical level/for large populations this shouldn't make a difference - i.e. outcome of a specific/few individual(s) may be different (which is what you're seeing) but if considering large enough populations those differences should cancel out if the ordering of the queue for equal-priority events is truly random. Could this be an indication that we're not considering large enough populations ?

tbhallett commented 3 months ago

By the same token as @marghe-molaro says, I am wondering whether the tests being used for the git bisect is sufficiently capturing the unexpected behaviour that @marghe-molaro notes in her runs. Might it be useful to either (i) do a large run again with the changes made so far and see if we recover the original behaviour; (ii) write a super-discriminatory test (e.g. in a large population over a short-run, compare squeeze factors against a reference value). ...?

marghe-molaro commented 3 months ago

By the same token as @marghe-molaro says, I am wondering whether the tests being used for the git bisect is sufficiently capturing the unexpected behaviour that @marghe-molaro notes in her runs. Might it be useful to either (i) do a large run again with the changes made so far and see if we recover the original behaviour; (ii) write a super-discriminatory test (e.g. in a large population over a short-run, compare squeeze factors against a reference value). ...?

Just to point out @tbhallett that wrt (ii), squeeze factors will remain constant even as the HSI queue is changed, because squeeze factors are only meaningful in mode=1, where the refactoring of the queue doesn't lead to any differences anyway.

As for (i) I can merge the changes that @willGraham01 has just made and then resubmit a new batch of runs in mode 2 (with and without rescaling, for curiosity) to check how things look. @willGraham01, have these changes already been merged in master?

And @willGraham01, can I double check first whether in these changes you are now forcing the order of the modules in the generic_first_appts to be the same as it used to be originally?

willGraham01 commented 3 months ago

As for (i) I can merge the changes that @willGraham01 has just made and then resubmit a new batch of runs in mode 2 (with and without rescaling, for curiosity) to check how things look. @willGraham01, have these changes already been merged in master?

And @willGraham01, can I double check first whether in these changes you are now forcing the order of the modules in the generic_first_appts to be the same as it used to be originally?

The branch I'm working on is https://github.com/UCL/TLOmodel/pull/1366, it hasn't been merged into master yet. This commit forces the "old" order of the modules though, and everything seems to be consistent with 012ef6d (commit before #1292 was merged).

@marghe-molaro feel free to merge #1366 into an experimental branch, but it 100% shouldn't be merged into master yet :sweat_smile: I also haven't had a chance to work through the remaining commits in #1292 (to resolve the merge conflicts with master) and thus, you'll get a lot of merge conflicts too!

I can do this, this afternoon & tomorrow morning, and @marghe-molaro I can notify you when you should be able to merge #1366 into an experimental branch, and set off a batch job like you suggested to see how things look?

marghe-molaro commented 3 months ago

Perfect, sounds great @willGraham01! I'll be on standby for #1366.

willGraham01 commented 3 months ago

@marghe-molaro #1366 now has master merged in and passes the script we've been using here. You can see the (now explicit) ordering of the modules in the generic appointments code.

Hopefully you should be able to merge this into your own branch :crossed_fingers: and run some scenarios to see if we've found the cause.

marghe-molaro commented 3 months ago

Hi @willGraham01 @matt-graham @tbhallett,

The test run from #1366 has now completed. We seem indeed to recover the behaviour that we were observing at the start (compare red line here with "No growth" case in left plot at the start of this Issue), i.e. when rescaling mode 2 there is no dramatic increase in AIDS DALYs after switching modes .

Yearly_DALYs_AIDS_After_Fix

While it is reassuring that we recover the original results, if the true cause of the discrepancy was the order in which the modules are called during the first generic appointment it is concerning that this would be observed on 100,000 population sized runs. However @willGraham01 if I understood correctly there were some other minor fixes you brought in with #1366, so maybe/hopefully the blame, for some for now non-identified reason, was with those?

We could maybe resubmit a run after reshuffling the order in which modules are called in #1366 to check whether this really is the source of the discrepancy? If we confirmed this to be the case this artificial effect would require some quite careful consideration.

tbhallett commented 3 months ago

While it is reassuring that we recover the original results, if the true cause of the discrepancy was the order in which the modules are called during the first generic appointment it is concerning that this would be observed on 100,000 population sized runs. However @willGraham01 if I understood correctly there were some other minor fixes you brought in with #1366, so maybe/hopefully the blame, for some for now non-identified reason, was with those?

We could maybe resubmit a run after reshuffling the order in which modules are called in #1366 to check whether this really is the source of the discrepancy? If we confirmed this to be the case this artificial effect would require some quite careful consideration.

Strong support these sentiments and the desire to understand if/how the ordering of processing in the first appointments has such a big effect.

matt-graham commented 3 months ago

However @willGraham01 if I understood correctly there were some other minor fixes you brought in with https://github.com/UCL/TLOmodel/pull/1366, so maybe/hopefully the blame, for some for now non-identified reason, was with those?

Will's away this week, but as far as I'm aware other than the module order change in #1366, the other changes were mainly changes to ensure calls to random number generator happened in same order to give consistent results (which would hopefully also not have a major affect on distribution of results with a large population). From quickly scanning the diff, the one change I can see that might affect model behaviour is a change to a conditional statement in the diarrhoea module to add an additional clause (have just tagged you both in a comment on this line as couldn't see an easy way to link from here).

marghe-molaro commented 3 months ago

Thank you @matt-graham! I was hoping you might be able to help me clarify something while @willGraham01 is away: in one of his previous replies he said that:

The failure cannot be because one of the modules in the main loop is editing any shared dataframe properties, since lines 164-180 occur after the dataframe update from the main loop is applied, but still use the "cached" values that the main loop utilised.

My understanding though is that prior @willGraham01's revision, modules were not only using shared dataframe properties during _do_on_generic_first_appt, but also modifying them, hence the need for this loop to exist at the end of the function:

        if proposed_patient_details_updates:
            df.loc[
                self.target, proposed_patient_details_updates.keys()
            ] = proposed_patient_details_updates.values()

Moving to the use of "cached" values would therefore necessarily lead to differences, as modules would now be accessing the individual's original parameters, rather than those updated by the modules called before inside _do_on_generic_first_appt. Am I missing something?

tbhallett commented 3 months ago

It may be a moot point, but I also noticed that this section, will lead to the updating not working as expected if more than module is trying to update the same property. (The updates from the last module will "win".)

https://github.com/UCL/TLOmodel/blob/6494277376ad3c66d27ddb547a6a61ac3f3012d1/src/tlo/methods/hsi_generic_first_appts.py#L93-L97

tamuri commented 3 months ago

As was mentioned in the catch-up meeting, this is now a blocker on several analyses. I'm at a workshop today, but will have time tomorrow to dig into it. I think if the situation looks difficult to fix quickly we should rollback the refactoring changes so epi team can continue on with their work.

marghe-molaro commented 3 months ago

Hi @tamuri, do you have any suggestions as to how we should proceed? If we do roll back is that going to come in through master? Many thanks in advance!

UCL / TLOmodel