BradGreig / Hybrid21CM

1 stars 3 forks source link

Memory Leak #29

Closed steven-murray closed 5 years ago

steven-murray commented 5 years ago

There seems to be a memory leak when using multiprocessing in the MCMC run.

Clearly, each iteration is run by a different process, which has to pickle/unpickle the entire LikelihoodComputationChain. However, it seems that the memory used in this procedure is not relinquished and builds up as more and more iterations are performed.

This is particularly onerous if one saves a lot of data in the Core/Likelihood classes, and definitely prohibits storing the init/perturb boxes in memory. One can always "get around" it by not saving any large datasets in the class, and instead reading them on each iteration, but this is annoying to have to do, and less efficient.

I'm not even sure where to start in tracking this down and fixing it.

BradGreig commented 5 years ago

Can you at least identify if the problem is on the C side or the Python side? Also, are you able to estimate the size of the leak?

steven-murray commented 5 years ago

It's to do with the Python pickling, so it's Python side for almost sure. The leak just keeps growing. Bella noticed it first with our custom extension, which saves the covariance matrix of the data in the class (it's reasonably large, like several MB's at least). Somehow, pickling/unpickling massively increases the amount of memory when dealing with an object, so basically every iter/walker would generate anoth ~100 MB. She was getting an OOM error, passing 1TB, after some number of iterations.

I will try and create an MWE.

BradGreig commented 5 years ago

A quick scan of the internet seems to reference this being a known issue. One possible solution is updating numpy.

If that fails is their a way to manually enforce the garbage collection? Seems this was an issue in some cases.

steven-murray commented 5 years ago

Yeah, I'm not sure how to do that for the queues.py module in multiprocessing. I will keep looking around. Meanwhile, here is a resaonably small MWE:

from py21cmmc import mcmc
import numpy as np

# ======== USER ADJUSTABLE VARIABLES
MODEL_NAME = "power_only"
CONT = False
THREADS = 1
WALK_RATIO = 2
ITER = 1
BURN = 0
# ===================================

# Stuff to track memory usage.
import tracemalloc
tracemalloc.start()
snapshot = tracemalloc.take_snapshot()

def trace_print():
    global snapshot
    snapshot2 = tracemalloc.take_snapshot()
    snapshot2 = snapshot2.filter_traces((
        tracemalloc.Filter(False, "<frozen importlib._bootstrap>"),
        tracemalloc.Filter(False, "<unknown>"),
        tracemalloc.Filter(False, tracemalloc.__file__)
    ))

    if snapshot is not None:
        print("================================== Begin Trace:")
        top_stats = snapshot2.compare_to(snapshot, 'lineno', cumulative=True)
        for stat in top_stats[:10]:
            print(stat)
    snapshot = snapshot2

class MyPrinterCore(mcmc.CoreCoevalModule):
    def setup(self):
        super().setup()
        self.big_array = np.zeros(10000)

    def build_model_data(self, ctx):
        trace_print()
        super().build_model_data(ctx)

core = MyPrinterCore(
    redshift=[9],
    user_params=dict(HII_DIM=50, BOX_LEN=125.0),
    flag_options=dict(USE_MASS_DEPENDENT_ZETA=False),
    do_spin_temp=False,
    z_step_factor=1.2,
    regenerate=True,   # ensure each run is started fresh
    initial_conditions_seed=1234  # ensure each run is exactly the same.
)

# Now the likelihood...
datafiles = ["data/power_mcmc_data_%s.npz" % z for z in core.redshift]
power_spec = mcmc.Likelihood1DPowerCoeval(
    datafile=datafiles,
    noisefile=None,
    logk=False, min_k=0.1, max_k=1.0,
    simulate=True
)

params = dict(
    HII_EFF_FACTOR=[30.0, 10.0, 50.0, 3.0],
    ION_Tvir_MIN=[4.7, 4, 6, 0.1])

chain = mcmc.run_mcmc(
    core, power_spec, datadir='data', model_name=MODEL_NAME,
    params=params,
    log_level_21CMMC='WARNING',
    walkersRatio=WALK_RATIO, burninIterations=BURN,
    sampleIterations=ITER, threadCount=THREADS, continue_sampling=CONT
)

Running it prints the largest memory addition on every iteration, and you can see the same amount being added at line 113 of queues.py. Increasing the amount stored in the Core increases this leak.

BradGreig commented 5 years ago

Yeah, I looked at the MWE and can see the issue. Reading some stuff here and one of the links within here they give some possible explanations/solutions. They seem to imply that its the referencing of the python/numpy arrays that are being copied. And that multiprocessing arrays/shared arrays might be able to solve the issue.

I don't have the time today to look into this, but thought I'd leave it here for reference.

steven-murray commented 5 years ago

I think this is fixed in 450d0966cd8b92beae0052f41466dd605da313ae. It turns out that it's not the numpy array in particular that's causing the leak, it's the circular reference between the LikelihoodComputationChain and the various Cores and Likelihoods. Adding an explicit level-2 garbage collection on every call to the likelihood fixes the problem at least in the MWE above. The garbage collection should be of minimal computational cost compared to everything else going on.