Closed tbhallett closed 1 month ago
Sadly, this is a very thorny problem without any straightforward solutions. Out-of-the-box pickle (or any of the other libraries) are ~not going~ unlikely to work because we do have circular references.
If you want to be able to do something today it's going to be very laborious but you can setup a VM and keep rerunning from a given snapshot.
We'll give it some thought.
Thanks very much for this.
Ok, I don't think it's soooo urgent that we need to do something today with a VM... and my guess is that this would be more cumbersome and painful than just waiting for the simulation to repeat itself many times. [@ihawryluk - what do you think?]
Absolutely don't need it anything done today, and it's ok I can repeat the simulation, worse case i'll just test fewer scenarios but that's fine.
~Sadly, this is a very thorny problem without any straightforward solutions~
Having said that, I think I found the source of the recursion in our code and fixed it. Need to test it more (and on bigger simulations) but...fingers crossed!
def test_pickling(obj):
filename = '/Users/tamuri/Desktop/testpick.pk'
pickle.dump(obj, open(filename, 'wb'))
return pickle.load(open(filename, 'rb'))
restored = test_pickling(sim)
>>> print(id(sim.population.props), len(sim.population.props))
4390187472 1014
>>> print(id(restored), len(restored.population.props))
4359269904 1014
>>> print(sim.population.props.equals(restored.population.props))
True
>>> print(len(sim.event_queue), len(restored.event_queue))
46 46
>>> print(sim.event_queue.next_event(), restored.event_queue.next_event())
(<tlo.methods.contraception.DelayedBirthEvent object at 0x1073715c0>, Timestamp('2011-01-04 01:20:26.263715'))
(<tlo.methods.contraception.DelayedBirthEvent object at 0x105ab8438>, Timestamp('2011-01-04 01:20:26.263715'))
>>> print(sim.date, restored.date)
2011-01-02 00:00:00 2011-01-02 00:00:00
Wow. That would be fantastic!
A possible alternative which avoids the need to make the simulation pickleable may be to use os.fork
to fork a new child process when an intervention is being applied. os.fork
is only available on Unix platforms, but relies directly on a system call where a process creates a copy of itself directly and so avoids needing to serialize the current state of the process using pickle
. While it has less requirements in terms of pickelability, conversely it is less flexible as this wouldn't allow saving a checkpoint to file and continuing as described above, and as both processes would be immediately active, a key issue would be in avoiding issues when accessing an shared external resources such as file handles - for example it would probably be necessary to do something like flush any pending writes to the log file in the parent process, close the file and then re-open new log files in each of the child processes (and have some utilities for combining).
Revisiting this (at least the most obvious solution - pickling). Out-of-the-box, pickling doesn't work. However, dill seems to do the right thing. Need to do plenty more checks, but an avenue to explore.
A small, one month, 25k pop sim:
from pathlib import Path
import pandas as pd
from tlo import Date, Simulation, logging
from tlo.analysis.utils import parse_log_file
from tlo.methods.fullmodel import fullmodel
from tlo.util import hash_dataframe
start_date = Date(2010, 1, 1)
end_date = start_date + pd.DateOffset(years=0, months=1)
resourcefilepath = Path("./resources")
sim=Simulation(start_date=start_date, seed=1)
sim.register(
*fullmodel(
resourcefilepath=resourcefilepath,
use_simplified_births=False,
module_kwargs={
"HealthSystem": {
"disable": True,
"mode_appt_constraints": 2,
"capabilities_coefficient": None,
"hsi_event_count_log_period": None
},
"SymptomManager": {"spurious_symptoms": False},
}
)
)
sim.make_initial_population(n=25000)
sim.simulate(end_date=end_date)
Pickling it errors:
import pickle
pickle.dump(sim, open('pickle-sim.pkl', 'wb'))
# ---------------------------------------------------------------------------
# AttributeError Traceback (most recent call last)
# Input In [5], in <cell line: 1>()
# ----> 1 pickle.dump(sim, open('pickle-sim.pkl', 'wb'))
#
# AttributeError: Can't pickle local object 'Models.make_lm_prob_becomes_stunted.<locals>.<lambda>'
"Dilling" it works:
import dill
dill.dump(sim, open('dill-sim.pkl', 'wb'))
Look at some key data structures:
In [16]: print(hash_dataframe(sim.population.props))
...: print(len(sim.event_queue.queue))
...: print(sim.event_queue.queue[0])
...: print(len(sim.modules['PregnancySupervisor'].mother_and_newborn_info))
...: (k, v), *_ = sim.modules['PregnancySupervisor'].mother_and_newborn_info.items()
...: print(k)
a1407ae5383681e54240b7c52e381f5b625c84e8
8498
(Timestamp('2010-02-01 00:00:00'), <Priority.FIRST_HALF_OF_DAY: 25>, 132, <tlo.methods.hiv.Hiv_DecisionToContinueTreatment object at 0x7fad96489b50>)
120
56
In a new Python session:
In [12]: import dill
...: sim = dill.load(open('dill-sim.pkl', 'rb'))
In [13]: print(hash_dataframe(sim.population.props))
...: print(len(sim.event_queue.queue))
...: print(sim.event_queue.queue[0])
...: print(len(sim.modules['PregnancySupervisor'].mother_and_newborn_info))
...: (k, v), *_ = sim.modules['PregnancySupervisor'].mother_and_newborn_info.items()
...: print(k)
a1407ae5383681e54240b7c52e381f5b625c84e8
8498
(Timestamp('2010-02-01 00:00:00'), <Priority.FIRST_HALF_OF_DAY: 25>, 132, <tlo.methods.hiv.Hiv_DecisionToContinueTreatment object at 0x7fa003eab0a0>)
120
56
Adding my notes from the programming meeting yesterday.
Save the state of the simulation at any point but, most importantly, snapshot run from 2010-2023.
Need to change existing scenario code so each numbered run within different draws has the same simulation seed
How to ensure the saved state is still valid? i.e. some change is made to the model, can the saved state still be used
How to ensure that parameters for different draws do not invalidate the checkpoint?
A challenging bit, in my opinion, was how to trigger/apply the intervention in, say, 2023.
on_checkpoint_load()
to override specific parameters only when the state is restored from a checkpoint file.First step is to check whether using pickle/dill to save the state works reliably. Suggestion to do some quick tests: run a full simulation, checkpoint in the middle. Use the checkpoint in a new run to see if we get the same result.
Hi @tamuri, when do you think point 2 ("change existing scenario code so each numbered run within different draws has the same simulation seed") could be implemented? This would benefit us right away without even getting to the checkpointing part
Should be reasonably quick - I'll try to get it in today.
Thinking about how this would work in light of Matt's work on #1227
User story: As an epidemiologist using TLOmodel, I want to run simulations testing a number of interventions without having to repeatedly run the first part of the simulation where there are no interventions, to reduce costs.
Steps:
Run scenario A, a scenario without interventions with desired end date and "run to" (is there better name?) to exit and write saved state of simulation. One draw, n runs. Submitted to Batch, gets job ID, results are written to usual place.
Run scenario B, a scenario with interventions with end date and "restore from" option, taking the job ID from above. Multiple draws, same n runs. When the simulation is restored, the parameters are taken from the scenario's draw_parameters()
method and a [new] method, say, set_parameter(name, value)
in relevant module in the restored simulation is called. The base Module class implements set_parameter()
to simply overwrite the key in self.PARAMETERS. Subclasses can override it to handle in some other way (e.g. refresh a linear model). Once done, the simulation can continue running. Batch gives the run a job ID.
Once the runs are complete, the results can be aggregated by collecting the log files from job running scenario A (2010-2023) and job running scenario B (say, 2023-2040).
We have a common use case as follows:
It would seem like a solution to this would be to able to save the simulation at a certain point to a file. Then load up the file and resume the simulation under the same or different parametric conditions.
I thought this might be relatively straight forward using pickle (i.e. pickle the sim: which contains the sim.population.props and the event_queue, and all the modules and their internal contents). Then, unpickle the sim, manipulate any parameters in the modules, and restart the sim using sim.simulate(end_date = end_of_part_two_date). (see script below)
However, I tried this and the unpicking failed with a RecursionError. Stack overflow suggested this is a common error for pickling complex classes and suggested increasing the limit on recursions -- but this led to the console crashing for me.
Do you have any thoughts on this?
Short-term:
Medium-term: