slower predict calls in parallel with george and emcee

Hi, I'm running into an issue with george and emcee. My likelihood function involves a call to gp.predict. This is fast in serial, but when I try to run emcee with a multiprocessing Pool, each gp.predict call takes a longer time. (Not sure whether I should have opened an issue here or at emcee!)

Below is a minimum failing example. When I run this on my 24-core machine, this is the output:

Serial took 5.1 seconds
Parallel multiprocessing took 39.5 seconds

The parallel one is much slower in total. Individual gp.predict calls are ~10x slower in parallel (serial: ~0.002s, parallel: ~0.02s).

I tried adding:

import os
os.environ["OMP_NUM_THREADS"] = "1"

Then I get this output:

Serial took 5.1 seconds
Parallel multiprocessing took 1.6 seconds

This makes the parallel job faster than the serial job, by about 3x. However, the individual predict calls still take 2-3x longer (serial: ~0.002s, parallel: ~0.004-0.006s), so I'm not getting the speedup I expect.

This holds up for varying N, nwalkers, and ndim. I am having this issue in my full-scale code with a large training set and high-dimensional statistics. There the predict calls take up to 6x longer in parallel (with OMP_NUM_THREADS="1").

Could this have to do with the same gp object not being able to be called by multiple threads at the same time? Is there a way around this / am I missing something with the parallelization? Thank you!

Minimum failing example:

import time
import numpy as np

import emcee
import george
from george import kernels

def log_prob(theta):
    xval = np.random.random()*10 # to be in same xrange
    s = time.time()
    pred, pred_var = gp.predict(y, xval, return_var=True)
    e = time.time()
    print("GP predict time:", e-s) #This will print many lines, may want to comment out
    return pred

## Set up MCMC parameters
nwalkers = 24
ndim = 2
np.random.seed(42)
initial = np.random.randn(nwalkers, ndim)
nsteps = 100

## Build GP (from george tutorial)
N = 1500
np.random.seed(1234)
x = 10 * np.sort(np.random.rand(N))
yerr = 0.2 * np.ones_like(x)
y = np.sin(x) + yerr * np.random.randn(len(x))
kernel = np.var(y) * kernels.ExpSquaredKernel(0.5)
gp = george.GP(kernel)
gp.compute(x, yerr)

## Serial
from multiprocessing import cpu_count
ncpu = cpu_count()
print("{0} CPUs".format(ncpu))
sampler = emcee.EnsembleSampler(nwalkers, ndim, log_prob)
start = time.time()
sampler.run_mcmc(initial, nsteps)
end = time.time()
serial_time = end - start
print("Serial took {0:.1f} seconds".format(serial_time))

## Parallel
from multiprocessing import Pool
with Pool() as pool:
    sampler = emcee.EnsembleSampler(nwalkers, ndim, log_prob, pool=pool)
    start = time.time()
    sampler.run_mcmc(initial, nsteps)
    end = time.time()
    multi_time = end - start
    print("Parallel multiprocessing took {0:.1f} seconds".format(multi_time))

dfm / george

slower predict calls in parallel with george and emcee #133