PositionAverager Transformation ends up with wrong results with parallel analysis

yuxuanzhuang commented 3 years ago

Expected behavior

The results are the same as serial analysis

Actual behavior

Due to the splitting approach, a new PositionAverager will be created for each block; no previous memory (self.coord_array) is saved.

Code to reproduce the behavior

import MDAnalysis as mda
from MDAnalysisData.adk_equilibrium import fetch_adk_equilibrium
adk = fetch_adk_equilibrium()
import matplotlib.pyplot as plt

from MDAnalysis.transformations.positionaveraging import PositionAverager

from MDAnalysis.analysis.rms import RMSD as serial_RMSD
from pmda.rms import RMSD as parallel_RMSD

u = mda.Universe(adk.topology, adk.trajectory)

pos_avg_trans = PositionAverager(1000)
u.trajectory.add_transformations(pos_avg_trans)

rmsd = serial_RMSD(u.atoms, u.atoms).run()
rmsd = parallel_RMSD(u.atoms, u.atoms).run(n_blocks=8)

# plot rmsd

Serial RMSD
Parallel RMSD

Current version of MDAnalysis

Which version are you using? (run python -c "import MDAnalysis as mda; print(mda.__version__)") 2.0.0-dev
Which version of Python (python -V)? 3.8
Which operating system? Ubuntu 20

orbeckst commented 3 years ago

There are going to be some transformations that are not parallel-safe. It's great when we can make it work but that might not always be easy and require different algorithms.

Can we just add a boolean attribute to the TransformationBase class parallelizable = False and then PMDA and friends can check? If we know that the Transformation can be run in parallel, we set it explicitly to True.

mnmelo commented 3 years ago

Yes, I think this is precisely the case here. Position averaging is intrinsically history-dependent, and as such it'll not play nice with block parallelization.

orbeckst commented 3 years ago

@yuxuanzhuang let's add parallelizable = False as an attribute to TransformationBase and have derived classes change it if they can be parallelized with split-apply-combine/block parallelization.

mnmelo commented 3 years ago

I am not sure I agree with the way this was implemented. parallelizable is now used as a kwarg to the Analysis __init__ to indicate parallelization compatibility. I think it'd have been much more pythonic to instead have parallelizable be a class attribute, since it should be a general characteristic of each Analysis, and not dependent on each instantiation.

Later, if the user wants to control parallelization from the instantiation/run of an Analysis, PMDA and friends will/should provide ways to force serial behavior.

What do you think? If you agree with a change, we're still in time to correct the API before 2.0.0.

yuxuanzhuang commented 3 years ago

I agree it is more pythonic to have it as a class attribute but given we don't yet have a definite API for parallel analysis nor is this parallelizable checked anywhere yet, it still feels less defined whether it is an internal indicator or something differs from instance to instance. For example, this parallelizable only indicates the ability to use this Transformation in block analysis, but might be different in other parallel conditions, e.g. parallel analysis among ensemble simulations. How should we deal with that?

mnmelo commented 3 years ago

First of all apologies that I mistakenly exemplified with the Analysis case, not the Transformations.

Regarding parallelizable I was assuming we were interpreting this as frame-wise 'split-apply-combine' parallelizability. It's perhaps best not to overload this single attr with other meanings.

Are there examples where the same transformation might be parallelizable or not, depending on intialization? I mean here the framewise parallelizability, but I guess we could discriminate multiple parallel possibilities if instead of a single attr we have this as a dict; i.e.: {'split-apply-combine': True, 'ensemble':True}.

MDAnalysis / mdanalysis