UCL / TLOmodel

Epidemiology modelling framework for the Thanzi la Onse project
https://www.tlomodel.org/
MIT License
13 stars 5 forks source link

Saving to file simulations in a suspended state and resuming #86

Closed tbhallett closed 1 month ago

tbhallett commented 4 years ago

We have a common use case as follows:

It would seem like a solution to this would be to able to save the simulation at a certain point to a file. Then load up the file and resume the simulation under the same or different parametric conditions.

I thought this might be relatively straight forward using pickle (i.e. pickle the sim: which contains the sim.population.props and the event_queue, and all the modules and their internal contents). Then, unpickle the sim, manipulate any parameters in the modules, and restart the sim using sim.simulate(end_date = end_of_part_two_date). (see script below)

However, I tried this and the unpicking failed with a RecursionError. Stack overflow suggested this is a common error for pickling complex classes and suggested increasing the limit on recursions -- but this led to the console crashing for me.

Do you have any thoughts on this?

Short-term:

Medium-term:

from pathlib import Path

from tlo import Date, Simulation
from tlo.methods import contraception, demography

outputpath = Path("./outputs")
resourcefilepath = Path("./resources")

start_date = Date(2010, 1, 1)
end_date_part_one = Date(2011, 1, 2)
popsize = 1000

sim = Simulation(start_date=start_date)
sim.register(demography.Demography(resourcefilepath=resourcefilepath))
sim.register(contraception.Contraception(resourcefilepath=resourcefilepath))
sim.seed_rngs(1)
sim.make_initial_population(n=popsize)
sim.simulate(end_date=end_date_part_one)

import pickle

with open(outputpath / 'pickled_basic_object', 'wb') as f:
    pickle.dump({'1': 1, '2': 2}, f)

with open(outputpath / 'pickled_sim', 'wb') as f:
    pickle.dump(sim, f)

with open(outputpath / 'pickled_event_queue', 'wb') as f:
    pickle.dump(sim.event_queue, f)

with open(outputpath / 'pickled_basic_object', 'rb') as f:
    x = pickle.load(f)

with open(outputpath / 'pickled_sim', 'rb') as f:
    x = pickle.load(f)   # fails

with open(outputpath / 'pickled_event_queue', 'rb') as f:
    x = pickle.load(f)   # fails

# # Increasing recursion limits -- didn't help!
# # https://stackoverflow.com/questions/3323001/what-is-the-maximum-recursion-depth-in-python-and-how-to-increase-it
# import sys
# sys.getrecursionlimit()
# sys.setrecursionlimit(90000)
tamuri commented 4 years ago

Sadly, this is a very thorny problem without any straightforward solutions. Out-of-the-box pickle (or any of the other libraries) are ~not going~ unlikely to work because we do have circular references.

If you want to be able to do something today it's going to be very laborious but you can setup a VM and keep rerunning from a given snapshot.

We'll give it some thought.

tbhallett commented 4 years ago

Thanks very much for this.

Ok, I don't think it's soooo urgent that we need to do something today with a VM... and my guess is that this would be more cumbersome and painful than just waiting for the simulation to repeat itself many times. [@ihawryluk - what do you think?]

ihawryluk commented 4 years ago

Absolutely don't need it anything done today, and it's ok I can repeat the simulation, worse case i'll just test fewer scenarios but that's fine.

tamuri commented 4 years ago

~Sadly, this is a very thorny problem without any straightforward solutions~

Having said that, I think I found the source of the recursion in our code and fixed it. Need to test it more (and on bigger simulations) but...fingers crossed!

def test_pickling(obj):
    filename = '/Users/tamuri/Desktop/testpick.pk'
    pickle.dump(obj, open(filename, 'wb'))
    return pickle.load(open(filename, 'rb'))

restored = test_pickling(sim)
>>> print(id(sim.population.props), len(sim.population.props))
4390187472 1014

>>> print(id(restored), len(restored.population.props))
4359269904 1014

>>> print(sim.population.props.equals(restored.population.props))
True

>>> print(len(sim.event_queue), len(restored.event_queue))
46 46

>>> print(sim.event_queue.next_event(), restored.event_queue.next_event())
(<tlo.methods.contraception.DelayedBirthEvent object at 0x1073715c0>, Timestamp('2011-01-04 01:20:26.263715'))
(<tlo.methods.contraception.DelayedBirthEvent object at 0x105ab8438>, Timestamp('2011-01-04 01:20:26.263715'))

>>> print(sim.date, restored.date)
2011-01-02 00:00:00 2011-01-02 00:00:00
tbhallett commented 4 years ago

Wow. That would be fantastic!

matt-graham commented 1 year ago

A possible alternative which avoids the need to make the simulation pickleable may be to use os.fork to fork a new child process when an intervention is being applied. os.fork is only available on Unix platforms, but relies directly on a system call where a process creates a copy of itself directly and so avoids needing to serialize the current state of the process using pickle. While it has less requirements in terms of pickelability, conversely it is less flexible as this wouldn't allow saving a checkpoint to file and continuing as described above, and as both processes would be immediately active, a key issue would be in avoiding issues when accessing an shared external resources such as file handles - for example it would probably be necessary to do something like flush any pending writes to the log file in the parent process, close the file and then re-open new log files in each of the child processes (and have some utilities for combining).

tamuri commented 1 year ago

Revisiting this (at least the most obvious solution - pickling). Out-of-the-box, pickling doesn't work. However, dill seems to do the right thing. Need to do plenty more checks, but an avenue to explore.

A small, one month, 25k pop sim:

from pathlib import Path
import pandas as pd
from tlo import Date, Simulation, logging
from tlo.analysis.utils import parse_log_file
from tlo.methods.fullmodel import fullmodel
from tlo.util import hash_dataframe

start_date = Date(2010, 1, 1)
end_date = start_date + pd.DateOffset(years=0, months=1)
resourcefilepath = Path("./resources")
sim=Simulation(start_date=start_date, seed=1)
sim.register(
    *fullmodel(
        resourcefilepath=resourcefilepath,
        use_simplified_births=False,
        module_kwargs={
            "HealthSystem": {
                "disable": True,
                "mode_appt_constraints": 2,
                "capabilities_coefficient": None,
                "hsi_event_count_log_period": None
            },
            "SymptomManager": {"spurious_symptoms": False},
        }
    )
)

sim.make_initial_population(n=25000)
sim.simulate(end_date=end_date)

Pickling it errors:

import pickle
pickle.dump(sim, open('pickle-sim.pkl', 'wb'))
# ---------------------------------------------------------------------------
# AttributeError                            Traceback (most recent call last)
# Input In [5], in <cell line: 1>()
# ----> 1 pickle.dump(sim, open('pickle-sim.pkl', 'wb'))
# 
# AttributeError: Can't pickle local object 'Models.make_lm_prob_becomes_stunted.<locals>.<lambda>'

"Dilling" it works:

import dill
dill.dump(sim, open('dill-sim.pkl', 'wb'))

Look at some key data structures:

In [16]: print(hash_dataframe(sim.population.props))
    ...: print(len(sim.event_queue.queue))
    ...: print(sim.event_queue.queue[0])
    ...: print(len(sim.modules['PregnancySupervisor'].mother_and_newborn_info))
    ...: (k, v), *_ = sim.modules['PregnancySupervisor'].mother_and_newborn_info.items()
    ...: print(k)

a1407ae5383681e54240b7c52e381f5b625c84e8
8498
(Timestamp('2010-02-01 00:00:00'), <Priority.FIRST_HALF_OF_DAY: 25>, 132, <tlo.methods.hiv.Hiv_DecisionToContinueTreatment object at 0x7fad96489b50>)
120
56

In a new Python session:

In [12]: import dill
    ...: sim = dill.load(open('dill-sim.pkl', 'rb'))

In [13]: print(hash_dataframe(sim.population.props))
    ...: print(len(sim.event_queue.queue))
    ...: print(sim.event_queue.queue[0])
    ...: print(len(sim.modules['PregnancySupervisor'].mother_and_newborn_info))
    ...: (k, v), *_ = sim.modules['PregnancySupervisor'].mother_and_newborn_info.items()
    ...: print(k)

a1407ae5383681e54240b7c52e381f5b625c84e8
8498
(Timestamp('2010-02-01 00:00:00'), <Priority.FIRST_HALF_OF_DAY: 25>, 132, <tlo.methods.hiv.Hiv_DecisionToContinueTreatment object at 0x7fa003eab0a0>)
120
56
tamuri commented 1 year ago

Adding my notes from the programming meeting yesterday.

A challenging bit, in my opinion, was how to trigger/apply the intervention in, say, 2023.

First step is to check whether using pickle/dill to save the state works reliably. Suggestion to do some quick tests: run a full simulation, checkpoint in the middle. Use the checkpoint in a new run to see if we get the same result.

marghe-molaro commented 1 year ago

Hi @tamuri, when do you think point 2 ("change existing scenario code so each numbered run within different draws has the same simulation seed") could be implemented? This would benefit us right away without even getting to the checkpointing part

tamuri commented 1 year ago

Should be reasonably quick - I'll try to get it in today.

tamuri commented 11 months ago

Thinking about how this would work in light of Matt's work on #1227

User story: As an epidemiologist using TLOmodel, I want to run simulations testing a number of interventions without having to repeatedly run the first part of the simulation where there are no interventions, to reduce costs.

Steps:

  1. Run scenario A, a scenario without interventions with desired end date and "run to" (is there better name?) to exit and write saved state of simulation. One draw, n runs. Submitted to Batch, gets job ID, results are written to usual place.

  2. Run scenario B, a scenario with interventions with end date and "restore from" option, taking the job ID from above. Multiple draws, same n runs. When the simulation is restored, the parameters are taken from the scenario's draw_parameters() method and a [new] method, say, set_parameter(name, value) in relevant module in the restored simulation is called. The base Module class implements set_parameter() to simply overwrite the key in self.PARAMETERS. Subclasses can override it to handle in some other way (e.g. refresh a linear model). Once done, the simulation can continue running. Batch gives the run a job ID.

  3. Once the runs are complete, the results can be aggregated by collecting the log files from job running scenario A (2010-2023) and job running scenario B (say, 2023-2040).