Bad Performance of Parallelization with On-the-fly Transformation

yuxuanzhuang commented 3 years ago

Expected behaviour

Analysis of a Universe with on-the-fly transformation scales good (reasonable).

Actual behaviour

The scaling performance is really bad even with two cores.

Code

import MDAnalysis as mda
from MDAnalysis import transformations as trans
from pmda.rms.rmsd import RMSD as parallel_rmsd

u = mda.Universe(files['PDB'], files['LONG_TRAJ']) #  9000 frames

fit_trans = trans.fit_rot_trans(u.atoms, u.atoms)
u.trajectory.add_transformations(fit_trans)

n_jobs = [1, 2, 4, 8, 16, 32, 64]

rmsd = parallel_rmsd(u.atoms, u.atoms)
rmsd.run(n_blocks=nj,
               n_jobs=nj) #  timeit

Reason

In some Transformations includes numpy.dot which itself is multi-threaded. So the cores are oversubscribed.

Possible solution

define NUM_THREADS=1 for numpy (https://docs.dask.org/en/latest/array-best-practices.html#avoid-oversubscribing-threads). which is surprisingly faster even for serial (single-core) performance.
use cupy(https://cupy.dev/) to leverage the GPU power. (only replacing the numpy.dot operation of the Transformation)

Benchmarking result

Linear Scaling RMSD with Transformation Comparison

Benchmarking system:
- AMD EPYC 7551 32-Core Processor
- RTX 2080 Ti
- cephfs file system

Currently version of MDAnalysis:

(run python -c "import MDAnalysis as mda; print(mda.__version__)") 2.0.0 dev (run python -c "import pmda; print(pmda.__version__)") (run python -c "import dask; print(dask.__version__)")

orbeckst commented 3 years ago

Your results are interesting.

What is the single blue bar "RMSD" result? Is this running the code in serial but allow multi threaded operations? I don't understand why it improves with n_cores/

Interestingly, cupy does not do better than single threaded numpy. This reminds me of results from a REU where Robert Delgado found the same http://doi.org/10.6084/m9.figshare.3823293.v1 because the data transfer to/from GPU was expensive.

yuxuanzhuang commented 3 years ago

The blue bar is the RMSD without any Transformation, just as a reference.

yuxuanzhuang commented 3 years ago

So ideally we would like to only use one thread for numpy here. But since it's a global setting, I am not sure it can be set locally.

orbeckst commented 3 years ago

If you set it as an environment variable from Python, does it work?

Is there a with block (context manager) available to do it?

yuxuanzhuang commented 3 years ago

Actually there is one. threadpoolctl. I am even able to use it as a decorator. The issue is, ideally, it should be able to decorate _single_frame (so everything inside would be limit to one thread), but I am not sure if there's a way to only do it once (e.g. inheritance from the Base class)

For the Transformation case, I guess it shouldn't be limited to only the parallel code, as it also improves (a lot) for serial code. So I am starting a PR (https://github.com/MDAnalysis/mdanalysis/pull/2950) for that.

Also, I should double-check the benchmark with another benchmarking system.

orbeckst commented 3 years ago

@yuxuanzhuang have you tried PMDA again since PR MDAnalysis/mdanalysis#2950 got merged?

yuxuanzhuang commented 3 years ago

EDIT: Okay, just ignore the following benchmark. I realize I was using the current pmda master branch, which does not support on-the-fly transformations (and just ignore it when reconstructing the Universe). Maybe we should push #132 forward since we will soon have a beta 2.0.0 MDA version release.

Sorry, I need to find some time to look into it. It is really weird to see parallel RMSD performs better with a single core... maybe I messed up something.

from MDAnalysis.analysis.rms import RMSD as serial_rmsd
from pmda.rms.rmsd import RMSD as parallel_rmsd

u = mda.Universe(files['PDB'], files['SHORT_TRAJ'])
fit_trans = trans.fit_rot_trans(u.atoms, u.atoms)
u.trajectory.add_transformations(fit_trans)

rmsd = serial_rmsd(u.atoms, u.atoms)
%time rmsd.run()

CPU times: user 20.6 s, sys: 961 ms, total: 21.5 s Wall time: 19.7 s

rmsd = parallel_rmsd(u.atoms, u.atoms)
%time rmsd.run(n_blocks=1, n_jobs=1)

CPU times: user 11.4 s, sys: 0 ns, total: 11.4 s Wall time: 11.4 s

yuxuanzhuang commented 3 years ago

The performance in our cluster is pretty bumpy recently so I am using my NUC for benchmarking instead (with the code above), here is the result. The bottleneck from transformations no longer exists.

orbeckst commented 3 years ago

Which versions of PMDA and MDA did you use for the above benchmark that shows similar scaling for RMSD and RMSD+fit+rot trans?

yuxuanzhuang commented 3 years ago

Which versions of PMDA and MDA did you use for the above benchmark that shows similar scaling for RMSD and RMSD+fit+rot trans?

The develop branch of MDA and PR #132 branch of PMDA.

MDAnalysis / pmda