pathos implementation in examples doesn't work without kernel restart

SteveEckel commented 3 years ago

We have been implementing parallel processing using pathos, using code like this

tmax = 1e4
args = ([0, tmax], )
kwargs = {'t_eval':np.linspace(0, tmax, 5001),
          'random_recoil':True,
          'progress_bar':False,
          'max_scatter_probability':0.5,}

def generate_random_solution(x):
    # We need to generate random numbers to prevent solutions from being seeded
    # with the same random number.
    np.random.rand(256*x)
    eqn.evolve_motion(*args, **kwargs)

    return eqn.sol

Natoms = 8
chunksize = 4
sols = []
progress = progressBar()
for jj in range(int(Natoms/chunksize)):
    with pathos.pools.ProcessPool(nodes=4) as pool:
        sols += pool.map(generate_random_solution, range(chunksize))
    progress.update((jj+1)/int(Natoms/chunksize))

The problem is that it does not actually rewrite eqn that i s being evolved in generate_random_solution is eqn is changed in the main program.

SteveEckel commented 3 years ago

If instead we execute using the built in map function:

sols = list(map(generate_random_solution, range(2)))

then things work properly and changing the eqn variable works as expected.

So it would appear that the problem is in pathos. It is as if it does not re-pickle things when you generate the new pool.

SteveEckel commented 3 years ago

Some additional information. It would appear that pathos.pools.ProcessPool is persistent, so it does not close down and delete itself outside of the with statement. Trying to close it and restart it causes serious problems, as in the code below:

tmax = 1e3
args = ([0, tmax], )
kwargs = {'t_eval':np.linspace(0, tmax, 5001),
          'random_recoil':True,
          'progress_bar':False,
          'max_scatter_probability':0.5,}

def generate_random_solution(x):
    # We need to generate random numbers to prevent solutions from being seeded
    # with the same random number.
    np.random.rand(256*x)
    eqn.evolve_motion(*args, **kwargs)

    return eqn.sol

if 'pool' in locals():
    pool.restart()
else:
    pool = pathos.pools.ProcessPool(nodes=4)

Natoms = 4
chunksize = 4
sols = []
progress = progressBar()
for jj in range(int(Natoms/chunksize)):
    sols += pool.map(generate_random_solution, range(chunksize))
    progress.update((jj+1)/int(Natoms/chunksize))

pool.close()

This code executes the first time without problem, then the second time fine but I get the following error in the terminal running Juypter: libc++abi.dylib: terminating with uncaught exception of type std::runtime_error: Couldn't close file. After the second time it does apparently repickle all the local variables and execute with appropriate changes to eqn, but tic marks no longer appear on matplotlib graphs, which is an interesting side effect. Finally, the third time executing fails with repeated libc++abi.dylib: terminating with uncaught exception of type std::runtime_error: Couldn't close file errors.

SteveEckel commented 3 years ago

So the problem is clearly using global variables in generate_random_solution. All the variables should be defined through function arguments, e.g. local variables. Only globally-imported packages should be used inside generate_random_solution, as these will always remain unchanged until a kernel restart. The following code seems to fix all the problems:

tmax = 1e3
args = ([0, tmax], )
kwargs = {'t_eval':np.linspace(0, tmax, 5001),
          'random_recoil':True,
          'progress_bar':False,
          'max_scatter_probability':0.5,}

def generate_random_solution(mapargs):
    """
    A function for each subprocess to execute.  

    Parameters
    ----------
        mapargs : tuple
            tuple of (index, eqn, args, kwargs) of the index
            of the simulation, the eqn object to call evolve_motion,
            args and kwargs passed to evolve_motion.

    Notes
    -----
    All variables inside this function are defined locally, no
    global variables can be used as these are not overwritten 
    once the function is defined and dill pickled by pathos.
    This gets around the persistency problem of the ProcessPool.
    """
    # Unpack the unique arguments for this function call:
    x, eqn, args, kwargs = mapargs

    # Use the index to define a unique starting random number
    np.random.rand(256*x)

    # Evolve motion:
    eqn.evolve_motion(*args, **kwargs)

    return eqn.sol

# Start the parallel process now, after defining generate_random_solution:
pool = pathos.pools.ProcessPool(nodes=4)

# Make a list of arguments for the generate_random_solution function 
# that we will map onto the parallel pool.  We copy the eqn object 
# so that each process executes on a different memory subspace, 
# otherwise we will have fun problems.
Natoms = 96
mapargs = []
for jj in range(Natoms):
    mapargs.append((jj, copy.copy(eqn), args, kwargs))

# Map this list of tuples onto the parallel pool, in chunks so we 
# can monitor progress.
chunksize = 4
sols = []
progress = progressBar()
for jj in range(int(Natoms/chunksize)):
    sols += pool.map(generate_random_solution, mapargs[jj*chunksize:(jj+1)*chunksize])
    progress.update((jj+1)/int(Natoms/chunksize))

SteveEckel commented 3 years ago

This implementation needs to be updated in the F=2->F'=3 1D optical molasses example. The 3D MOT temperature example with the OBEs uses the run_single_sim.py old version of parallel processing. That example can be updated as well.

Lufter commented 10 months ago

Hi Steve, I want to follow up on this issue to see if there is an updated solution. I am looking at the recoil-limit MOT example,10_recoil_limited_MOT.ipynb, and noticed that generate_random_solution always returns the same eqn.sol for all 1024 runs of the simulation. As following,

`def generate_random_solution(x, eqn=eqn, tmax=tmax):

  # We need to generate random numbers to prevent solutions from being seeded
  # with the same random number.
  import numpy as np
  np.random.rand(256*x)

  eqn.evolve_motion(
      [0, tmax],
      t_eval=np.linspace(0, tmax, 1001),
      random_recoil=True,
      progress_bar=False,
      max_step=1.
)
return eqn.sol

`

So, I am wondering if there is an updated version of this example. Will you recommend me using other multiprocess packages instead? Thank you!

SteveEckel commented 10 months ago

Hi @Lufter. What you bring up is a slightly is a slightly different issue, but nonetheless an important one. I just tested 10_recoil_limited_MOT.ipynb on my Mac with python and did indeed notice that every solution in a given chunk was indeed identical (but not every solution, as you report). This problem is related to how we are seeding the random generators in the subprocesses.

Numpy has documentation for seeding subprocess random number generators. The preferred method to pass each subprocess a uniquely initiated random number generator. Indeed, evolve_motion has an rng keyword argument to allow that to pass, but we never updated the example to show the implementation. So, adding that to the issue described above, we get the following code:

import pathos
import copy

def generate_random_solution(mapargs):
    """
    A function for each subprocess to execute.  

    Parameters
    ----------
        mapargs : tuple
            tuple of (eqn, args, kwargs) of the index
            of the simulation, the eqn object to call evolve_motion,
            args and kwargs passed to evolve_motion.  Note that
            `rng` must be defined in kwargs 

    Notes
    -----
    All variables inside this function are defined locally, no
    global variables can be used as these are not overwritten 
    once the function is defined and dill pickled by pathos.
    This gets around the persistency problem of the ProcessPool.
    """
    # Unpack the unique arguments for this function call:
    eqn, args, kwargs = mapargs

    # Evolve motion:
    eqn.evolve_motion(*args, **kwargs)

    return eqn.sol

# Start the parallel process now, after defining generate_random_solution:
pool = pathos.pools.ProcessPool(nodes=4)

# Specify the number of atoms that we are simulated here:
Natoms = 8

# Make a random number generator and use that to seed the random
# number generators for all the subprocesses through spawn.  These
# child rngs will then be passed to evolve_motion through the `rng`
# keyword argument.
rng = np.random.default_rng()
child_rngs = rng.spawn(Natoms)

# Define the common arguments and keyword arguments that will be 
# passed to the equations:
args = ([0, tmax], )
kwargs = {'t_eval':np.linspace(0, tmax, 5001),
          'random_recoil':True,
          'progress_bar':False,
          'max_scatter_probability':0.5,}

# Make a list of arguments for the generate_random_solution function 
# that we will map onto the parallel pool.  We copy the eqn object 
# so that each process executes on a different memory subspace, 
# otherwise we will have fun problems.
mapargs = []
for jj in range(Natoms):
    mapargs.append((copy.copy(eqn), args, {**kwargs, 'rng':child_rngs[jj]}))

# Map this list of tuples onto the parallel pool, in chunks so we 
# can monitor progress.
chunksize = 4
sols = []
progress = progressBar()
for jj in range(int(Natoms/chunksize)):
    sols += pool.map(generate_random_solution, mapargs[jj*chunksize:(jj+1)*chunksize])
    progress.update((jj+1)/int(Natoms/chunksize))

If you copy and paste this code into the appropriate cell in the example (replacing the initial, problematic code), it should work and give you eight different solutions. If it doesn't, let us know and we'll try to diagnose the problem further.

JQIamo / pylcp

pathos implementation in examples doesn't work without kernel restart #42