Multiprocessing is slow: too much overhead spinning up new processes

dflemin3 / approxposterior

A Python package for approximate Bayesian inference and optimization using Gaussian processes

https://dflemin3.github.io/approxposterior/

MIT License

41 stars 9 forks source link

Multiprocessing is slow: too much overhead spinning up new processes #35

Closed dflemin3 closed 4 years ago

dflemin3 commented 5 years ago

Currently, the multiprocess implementation for parallelizing GP optimizations and new design point selection is slow, presumably because spinning up new processes is expensive since it requires pickling the GP and sending it to each process, which can be expensive due to the GP's non-trivial structure and large-ish covariance matrix.

Potential fixes include schemes to share the data with each process using scheme like what is documented here.

dflemin3 commented 5 years ago

Specifically, I really need to avoid copying the full GP. What would likely be better is to pass everything the GP needs, e.g. theta, y, current hyperparameter vector, and instantiate the GP for each function call for each process. For even GP's with ~1000s of data points, the initialization, including the compute call, should be of order 1 second, see the george docs, so re-initializing a GP for each process should be cheaper than serializing the full GP object.

dflemin3 commented 4 years ago

I've removed multiprocessing for now as it's current overhead is prohibitively slow and will require a substantial rewrite.