thisac commented 4 years ago

Context: The probabilities function could be easily parallelised, potentially providing nice speed-ups.

Description of the Change: Adds a parallelisation option to the probabilities function so that the probabilities can be calculated simultaneously. Tests are also adapted to include running the probabilities function with parallel=True and parallel=False.

Benefits: The probability-calculations can be parallelised and thus possibly done much quicker.

Possible Drawbacks: Since the OpenMP is already utilising parallelisation in the background, the implementations in this PR does not necessarily provide any speedups, but rather gives the option to parallelise differently.

Related GitHub Issues: N\A

josh146 commented 4 years ago

@thisac, @nquesada: It looks like the build is failing because llvmlite no longer provides wheels for Python 3.5 on the latest version.

You could try pinning llvmlite+numba to the previous version that supported Python 3.5, however it probably makes sense to deprecate Python 3.5 support.

Do you want to make another PR that officially removes Python 3.5 support? This would involve updating the setup.py, the readme, installation instructions, and travis/circle/appveyor configs.

nquesada commented 4 years ago

I think we should stop supporting python 3.5 as @josh146 suggests.

codecov[bot] commented 4 years ago

Codecov Report

Merging #161 into master will not change coverage. The diff coverage is 100.00%.

@@            Coverage Diff            @@
##            master      #161   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           14        14           
  Lines         1002      1080   +78     
=========================================
+ Hits          1002      1080   +78

Impacted Files	Coverage Δ
thewalrus/quantum.py	`100.00% <100.00%> (ø)`
thewalrus/samples.py	`100.00% <0.00%> (ø)`
thewalrus/csamples.py	`100.00% <0.00%> (ø)`
thewalrus/symplectic.py	`100.00% <0.00%> (ø)`
thewalrus/fock_gradients.py	`100.00% <0.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update cf7a19e...8a519ad. Read the comment docs.

nquesada commented 4 years ago

Hey @thisac : Could you add some screen shots of the improvements you found when tweaking OMP_NUM_THREADS here? Also maybe it is worth mentioning that in the docstring? For example explain to the user when it is worth allowing parallel=True and how to do it by exporting environment variables. Other than that I think this is ready to be merged.

thisac commented 4 years ago

Output from some simple benchmarks

Below are the results from some simple benchmarks comparing how the use of parallelisation can speed things up (or slow things down).

OpenMP uses parallelisation, which can be turned of by setting the environment variable OMP_NUM_THREADS=1, meaning that only a single thread will be used. By default this is set to use all threads (8 in this case).

Dask can use either the "threads" (uses multiple threads in the same process) or the "processes" (sends data to separate processes) scheduler. "threads" is bound by the GIL and is thus best to use with non-python objects while "processes" works best with pure python code (with a slight overhead). See https://docs.dask.org/en/latest/setup/single-machine.html for more information.

As seen below, the fastest run seems to be when just using the Dask parallelisation in probabilities while using the "processes" scheduler and while turning off OpenMP parallelisation.

from thewalrus.quantum import probabilities as p
import numpy as np

n = 4
mu = np.random.random(2*n)
cov = np.random.random((2*n, 2*n))
cov += cov.conj().T

With OpenMP parallelisation and with "threads" sheduler in Dask

print("\nNo parallel excecution with Dask")
%timeit p(mu, cov, cutoff=4, parallel=False)

print("\nWith parallel excecution with Dask")
%timeit p(mu, cov, cutoff=4, parallel=True)

No parallel excecution with Dask
419 ms ± 74.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

With parallel excecution with Dask
2.72 s ± 58.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Without OpenMP parallelisation and with "threads" sheduler in Dask

%set_env OMP_NUM_THREADS=1

print("\nNo parallel excecution with Dask")
%timeit p(mu, cov, cutoff=4, parallel=False)

print("\nWith parallel excecution with Dask")
%timeit p(mu, cov, cutoff=4, parallel=True)

env: OMP_NUM_THREADS=1

No parallel excecution with Dask
632 ms ± 28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

With parallel excecution with Dask
748 ms ± 2.89 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Without OpenMP parallelisation and with "processes" sheduler in Dask

%set_env OMP_NUM_THREADS=1

print("\nNo parallel excecution with Dask")
%timeit p(mu, cov, cutoff=4, parallel=False)

print("\nWith parallel excecution with Dask")
%timeit p(mu, cov, cutoff=4, parallel=True)

env: OMP_NUM_THREADS=1

No parallel excecution with Dask
605 ms ± 11.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

With parallel excecution with Dask
302 ms ± 6.04 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

josh146 commented 4 years ago

That's great @thisac! It might also be interesting to see how this scales as the number of modes/cutoff increases.

thisac commented 4 years ago

I've moved the environment variable changes to the tests (so no changes are being made in the probabilities function itself), utilising monkeypatch.setenv (thanks @josh146 :1st_place_medal:).

I've also checked locally that OMP_NUM_THREADS=1 whenever parallel=True during the testing. The docstring is also slightly altered.

nquesada commented 4 years ago

One can set up the number of OMP threads by doing os.environ["OMP_NUM_THREADS"] = "1"

XanaduAI / thewalrus

Parallelise probabilities #161

Codecov Report

Output from some simple benchmarks

With OpenMP parallelisation and with "threads" sheduler in Dask

Without OpenMP parallelisation and with "threads" sheduler in Dask

Without OpenMP parallelisation and with "processes" sheduler in Dask