Saving to file simulations in a suspended state and resuming

tbhallett commented 4 years ago

We have a common use case as follows:

We want to run a simulation up to a certain point (e.g. to before some policy changes)
Then we want to run the simulation from that point under multiple sets of assumptions ('forward projections')
The state from which the simulation starts should be the same in each 'forward projection'
This could be accomplished through control of the random seed so that the first part of simulation. -- but it's wasteful to repeat verbatim the first part of the simulation.

It would seem like a solution to this would be to able to save the simulation at a certain point to a file. Then load up the file and resume the simulation under the same or different parametric conditions.

I thought this might be relatively straight forward using pickle (i.e. pickle the sim: which contains the sim.population.props and the event_queue, and all the modules and their internal contents). Then, unpickle the sim, manipulate any parameters in the modules, and restart the sim using sim.simulate(end_date = end_of_part_two_date). (see script below)

However, I tried this and the unpicking failed with a RecursionError. Stack overflow suggested this is a common error for pickling complex classes and suggested increasing the limit on recursions -- but this led to the console crashing for me.

Do you have any thoughts on this?

Short-term:

Having a basic form of this functionality (even if hacky) would really help Iwona's MRes project. Any ideas would be very welcome!?

Medium-term:

Will this be a part (or could it be?) of the run management system?

from pathlib import Path

from tlo import Date, Simulation
from tlo.methods import contraception, demography

outputpath = Path("./outputs")
resourcefilepath = Path("./resources")

start_date = Date(2010, 1, 1)
end_date_part_one = Date(2011, 1, 2)
popsize = 1000

sim = Simulation(start_date=start_date)
sim.register(demography.Demography(resourcefilepath=resourcefilepath))
sim.register(contraception.Contraception(resourcefilepath=resourcefilepath))
sim.seed_rngs(1)
sim.make_initial_population(n=popsize)
sim.simulate(end_date=end_date_part_one)

import pickle

with open(outputpath / 'pickled_basic_object', 'wb') as f:
    pickle.dump({'1': 1, '2': 2}, f)

with open(outputpath / 'pickled_sim', 'wb') as f:
    pickle.dump(sim, f)

with open(outputpath / 'pickled_event_queue', 'wb') as f:
    pickle.dump(sim.event_queue, f)

with open(outputpath / 'pickled_basic_object', 'rb') as f:
    x = pickle.load(f)

with open(outputpath / 'pickled_sim', 'rb') as f:
    x = pickle.load(f)   # fails

with open(outputpath / 'pickled_event_queue', 'rb') as f:
    x = pickle.load(f)   # fails

# # Increasing recursion limits -- didn't help!
# # https://stackoverflow.com/questions/3323001/what-is-the-maximum-recursion-depth-in-python-and-how-to-increase-it
# import sys
# sys.getrecursionlimit()
# sys.setrecursionlimit(90000)

tamuri commented 4 years ago

Sadly, this is a very thorny problem without any straightforward solutions. Out-of-the-box pickle (or any of the other libraries) are ~not going~ unlikely to work because we do have circular references.

If you want to be able to do something today it's going to be very laborious but you can setup a VM and keep rerunning from a given snapshot.

We'll give it some thought.

tbhallett commented 4 years ago

Thanks very much for this.

Ok, I don't think it's soooo urgent that we need to do something today with a VM... and my guess is that this would be more cumbersome and painful than just waiting for the simulation to repeat itself many times. [@ihawryluk - what do you think?]

ihawryluk commented 4 years ago

Absolutely don't need it anything done today, and it's ok I can repeat the simulation, worse case i'll just test fewer scenarios but that's fine.

tamuri commented 4 years ago

~Sadly, this is a very thorny problem without any straightforward solutions~

Having said that, I think I found the source of the recursion in our code and fixed it. Need to test it more (and on bigger simulations) but...fingers crossed!

def test_pickling(obj):
    filename = '/Users/tamuri/Desktop/testpick.pk'
    pickle.dump(obj, open(filename, 'wb'))
    return pickle.load(open(filename, 'rb'))

restored = test_pickling(sim)

>>> print(id(sim.population.props), len(sim.population.props))
4390187472 1014

>>> print(id(restored), len(restored.population.props))
4359269904 1014

>>> print(sim.population.props.equals(restored.population.props))
True

>>> print(len(sim.event_queue), len(restored.event_queue))
46 46

>>> print(sim.event_queue.next_event(), restored.event_queue.next_event())
(<tlo.methods.contraception.DelayedBirthEvent object at 0x1073715c0>, Timestamp('2011-01-04 01:20:26.263715'))
(<tlo.methods.contraception.DelayedBirthEvent object at 0x105ab8438>, Timestamp('2011-01-04 01:20:26.263715'))

>>> print(sim.date, restored.date)
2011-01-02 00:00:00 2011-01-02 00:00:00

tbhallett commented 4 years ago

Wow. That would be fantastic!

matt-graham commented 1 year ago

A possible alternative which avoids the need to make the simulation pickleable may be to use os.fork to fork a new child process when an intervention is being applied. os.fork is only available on Unix platforms, but relies directly on a system call where a process creates a copy of itself directly and so avoids needing to serialize the current state of the process using pickle. While it has less requirements in terms of pickelability, conversely it is less flexible as this wouldn't allow saving a checkpoint to file and continuing as described above, and as both processes would be immediately active, a key issue would be in avoiding issues when accessing an shared external resources such as file handles - for example it would probably be necessary to do something like flush any pending writes to the log file in the parent process, close the file and then re-open new log files in each of the child processes (and have some utilities for combining).

tamuri commented 1 year ago

Revisiting this (at least the most obvious solution - pickling). Out-of-the-box, pickling doesn't work. However, dill seems to do the right thing. Need to do plenty more checks, but an avenue to explore.

A small, one month, 25k pop sim:

from pathlib import Path
import pandas as pd
from tlo import Date, Simulation, logging
from tlo.analysis.utils import parse_log_file
from tlo.methods.fullmodel import fullmodel
from tlo.util import hash_dataframe

start_date = Date(2010, 1, 1)
end_date = start_date + pd.DateOffset(years=0, months=1)
resourcefilepath = Path("./resources")
sim=Simulation(start_date=start_date, seed=1)
sim.register(
    *fullmodel(
        resourcefilepath=resourcefilepath,
        use_simplified_births=False,
        module_kwargs={
            "HealthSystem": {
                "disable": True,
                "mode_appt_constraints": 2,
                "capabilities_coefficient": None,
                "hsi_event_count_log_period": None
            },
            "SymptomManager": {"spurious_symptoms": False},
        }
    )
)

sim.make_initial_population(n=25000)
sim.simulate(end_date=end_date)

Pickling it errors:

import pickle
pickle.dump(sim, open('pickle-sim.pkl', 'wb'))
# ---------------------------------------------------------------------------
# AttributeError                            Traceback (most recent call last)
# Input In [5], in <cell line: 1>()
# ----> 1 pickle.dump(sim, open('pickle-sim.pkl', 'wb'))
# 
# AttributeError: Can't pickle local object 'Models.make_lm_prob_becomes_stunted.<locals>.<lambda>'

"Dilling" it works:

import dill
dill.dump(sim, open('dill-sim.pkl', 'wb'))

Look at some key data structures:

In [16]: print(hash_dataframe(sim.population.props))
    ...: print(len(sim.event_queue.queue))
    ...: print(sim.event_queue.queue[0])
    ...: print(len(sim.modules['PregnancySupervisor'].mother_and_newborn_info))
    ...: (k, v), *_ = sim.modules['PregnancySupervisor'].mother_and_newborn_info.items()
    ...: print(k)

a1407ae5383681e54240b7c52e381f5b625c84e8
8498
(Timestamp('2010-02-01 00:00:00'), <Priority.FIRST_HALF_OF_DAY: 25>, 132, <tlo.methods.hiv.Hiv_DecisionToContinueTreatment object at 0x7fad96489b50>)
120
56

In a new Python session:

In [12]: import dill
    ...: sim = dill.load(open('dill-sim.pkl', 'rb'))

In [13]: print(hash_dataframe(sim.population.props))
    ...: print(len(sim.event_queue.queue))
    ...: print(sim.event_queue.queue[0])
    ...: print(len(sim.modules['PregnancySupervisor'].mother_and_newborn_info))
    ...: (k, v), *_ = sim.modules['PregnancySupervisor'].mother_and_newborn_info.items()
    ...: print(k)

a1407ae5383681e54240b7c52e381f5b625c84e8
8498
(Timestamp('2010-02-01 00:00:00'), <Priority.FIRST_HALF_OF_DAY: 25>, 132, <tlo.methods.hiv.Hiv_DecisionToContinueTreatment object at 0x7fa003eab0a0>)
120
56

tamuri commented 1 year ago

Adding my notes from the programming meeting yesterday.

Save the state of the simulation at any point but, most importantly, snapshot run from 2010-2023.
- Resume sim with different interventions from this point
Need to change existing scenario code so each numbered run within different draws has the same simulation seed
- this is not what it does currently; every run has its own seed
How to ensure the saved state is still valid? i.e. some change is made to the model, can the saved state still be used
- Not a concern for us; we'd rerun the initial period and checkpoint for every change to the model
  - Still worth doing because we may have 5 runs for 60 interventions.
- We could save state for v1 of the model, and offer that to others
How to ensure that parameters for different draws do not invalidate the checkpoint?
- Perhaps label certain parameters and only allow those to be used with checkpointing

A challenging bit, in my opinion, was how to trigger/apply the intervention in, say, 2023.

At the moment, if we want an intervention, we add an event when the sim is initialised and schedule it for 2023.
But this means the state of the model 2010-2023 will be different for each draw, because the intervention event in the event queue is different.
Even if we set a single parameter at initialisation indicating a different intervention, the state of the model 2010-2023 is different because the parameter is different.
This means interventions need to be applied in a different way:
- Possible solutions include an on_checkpoint_load() to override specific parameters only when the state is restored from a checkpoint file.
- However, this means that we have two separate mechanisms for testing interventions. One when running a single simulation run 2010-2050; another when running with a checkpoint.
  - How do we reconcile these?

First step is to check whether using pickle/dill to save the state works reliably. Suggestion to do some quick tests: run a full simulation, checkpoint in the middle. Use the checkpoint in a new run to see if we get the same result.

marghe-molaro commented 1 year ago

Hi @tamuri, when do you think point 2 ("change existing scenario code so each numbered run within different draws has the same simulation seed") could be implemented? This would benefit us right away without even getting to the checkpointing part

tamuri commented 1 year ago

Should be reasonably quick - I'll try to get it in today.

tamuri commented 11 months ago

Thinking about how this would work in light of Matt's work on #1227

User story: As an epidemiologist using TLOmodel, I want to run simulations testing a number of interventions without having to repeatedly run the first part of the simulation where there are no interventions, to reduce costs.

Steps:

Run scenario A, a scenario without interventions with desired end date and "run to" (is there better name?) to exit and write saved state of simulation. One draw, n runs. Submitted to Batch, gets job ID, results are written to usual place.
Run scenario B, a scenario with interventions with end date and "restore from" option, taking the job ID from above. Multiple draws, same n runs. When the simulation is restored, the parameters are taken from the scenario's draw_parameters() method and a [new] method, say, set_parameter(name, value) in relevant module in the restored simulation is called. The base Module class implements set_parameter() to simply overwrite the key in self.PARAMETERS. Subclasses can override it to handle in some other way (e.g. refresh a linear model). Once done, the simulation can continue running. Batch gives the run a job ID.
Once the runs are complete, the results can be aggregated by collecting the log files from job running scenario A (2010-2023) and job running scenario B (say, 2023-2040).

UCL / TLOmodel

Saving to file simulations in a suspended state and resuming #86