[Discussion] All Pairs/ All-All Analysis

jdetle commented 8 years ago

Issue

I was talking with @orbeckst and @sseyler at Scipy and I brought up the fact that it seems for many analysis cases require the creation of a matrix that calculates some value for all possible i,j frames in a trajectory. This is a doubly nested loop. IIRC, In the case of ~MDAnalysis.analysis.psa it is a loop over all these doubly nested loops. These are problems that don't make much sense to abstract into a _single_frame call. Tools presented at scipy like numba, dask, and loopy more easily optimize a doubly nested loop and other things that can be easily interpreted into C code. Although the single_frame idea easily corresponds to the idea of a map reduce operation, I contend that it is a poor abstraction in the use case of PCA, PSA, and Diffusion Maps (and probably hurts speedup).

Proposal

This is something that I want @MDAnalysis/coredevs to discuss so we can figure out the best method. Some abstraction like:

def _call(traj_index, ts):
   # call either single_frame() and/or all_pairs() here 
   pass

def _single_frame(traj_index, ts):
   pass

def _all_pairs(traj_index, ts):
   pass

def run(self):
        """Perform the calculation"""
        logger.info("Starting preparation")
        self._prepare()
        for i, ts in enumerate(
                self._trajectory[self.start:self.stop:self.step]):
            self._frame_index = i
            self._ts = ts
            self._call(i, ts)
            # logger.info("--> Doing frame {} of {}".format(i+1, self.n_frames))
            self._pm.echo(self._frame_index)
        logger.info("Finishing up")
        self._conclude()
        return self

richardjgowers commented 8 years ago

If I'm understanding right, the current analysis base only handles single frame by frame. So this is for a new pattern that does pairs of frames?

On Fri, 15 Jul 2016, 21:40 John Detlefs, notifications@github.com wrote:

Issue

I was talking with @orbeckst https://github.com/orbeckst and @sseyler https://github.com/sseyler at Scipy and I brought up the fact that it seems for many analysis cases require the creation of a matrix that calculates some value for all possible i,j frames in a trajectory. This is a doubly nested loop. IIRC, In the case of ~MDAnalysis.analysis.psa it is a loop over all these doubly nested loops. These are problems that don't make much sense to abstract into a _single_frame call. Tools presented at scipy like numba, dask, and loopy more easily optimize a doubly nested loop and other things that can be easily interpreted into C code. Although the single_frame idea easily corresponds to the idea of a map reduce operation, I contend that it is a poor abstraction in the use case of PCA, PSA, and Diffusion Maps (and probably hurts speedup). Proposal

This is something that I want @MDAnalysis/coredevs https://github.com/orgs/MDAnalysis/teams/coredevs to discuss so we can figure out the best method. Some abstraction like:

def call(traj_index, ts):

call either single_frame() and/or all_pairs() here

pass def single_frame(traj_index, ts): pass def all_pairs(traj_index, ts): pass

ef run(self): """Perform the calculation""" logger.info("Starting preparation") self._prepare() for i, ts in enumerate( self._trajectory[self.start:self.stop:self.step]): self._frame_index = i self._ts = ts self._call

logger.info("--> Doing frame {} of {}".format(i+1, self.n_frames))
        self._pm.echo(self._frame_index)
    logger.info("Finishing up")
    self._conclude()
    return self
— You are receiving this because you are on a team that was mentioned. Reply to this email directly, view it on GitHub https://github.com/MDAnalysis/mdanalysis/issues/905, or mute the thread https://github.com/notifications/unsubscribe-auth/AI0jBwSY_7DCPY5gaBrhD8teCNG621znks5qV_A5gaJpZM4JNw_3 .

jdetle commented 8 years ago

I think I screwed up the pseudocode a little bit. Sorry I'm at scipy and a little distracted. _all_pairs should only be called once in theory, returning a matrix of all possible pairwise values over a trajectory.

orbeckst commented 8 years ago

“All pairs” is a common pattern, and it would be very useful to think about how to do this, or at least, if it would be useful to have a common API.

On 15 Jul, 2016, at 15:51, John Detlefs notifications@github.com wrote:

I think I screwed up the pseudocode a little bit. Sorry I'm at scipy and a little distracted. _all_pairs should only be called once in theory.

Oliver Beckstein * orbeckst@gmx.net skype: orbeckst * orbeckst@gmail.com

jdetle commented 8 years ago

Many spectral methods require the creation of a matrix that is essentially data x data, scikit learn has the module ~sklearn.metrics.pairwise in which it has the function def pairwise_distances https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/metrics/pairwise.py#L1129. This data x data matrix is very common.

PCA (in one of two algorithms) requires the generation of an all pairs matrix in which its all pairs of possible covariances instead of metrics, which further reinforces the convenience of this abstraction.

As an aside, scikit-learn uses SVD of a data matrix to generate the principal components. The proof for why SVD creates the principal components is fairly elegant!

jdetle commented 8 years ago

@richardjgowers I want to address this before moving forward in PCA. I can update my schedule accordingly, but this will facilitate tremendous speedups for plenty of analysis methods.

jbarnoud commented 8 years ago

Just some thought. I would call the method _all_frame_pairs to be more explicit. Also, would it be possible to have analyses defining a _single_frame_pair method, and let a generic method take care of running all the pairs? In the same idea as for _single_frame, it would remove some responsibility to the analysis dev, and let us deal with mapping/reducing the matrix latter on. There always is the possibility to overwrite the method that build the whole matrix for strange cases.

jdetle commented 8 years ago

There should be a significant speedup simply by writing an all_frame_pairs function in cython that would not happen in a contrived single_frame function. This should be the first goal of a speedup from this abstraction. Map-reduce parallelization is a more daunting task. In my opinion, first we should figure out how to implement a working parallelization and then figure out any appropriate abstractions to help out an analysis dev second. I think this is a situation where it is beneficial to play around with things and figure out what works well first and then clean things up later.

For the purpose of getting some work done, I could make the distance matrix in diffusion maps significantly faster right now simply by refactoring the code to use a cythonized all_frame_pairs function. This is a significantly more straightforward task than parallelization. It also applies to quickly calculating a covariance matrix in PCA.

@richardjgowers This is what I would like to work on right now. Our PCA expert is on vacation and the speedups from cython and an all_frame_pairs function should be noticeable. What do you think?

jbarnoud commented 8 years ago

In my opinion, first we should figure out how to implement a working parallelization and then figure out any appropriate abstractions to help out an analysis dev second. I think this is a situation where it is beneficial to play around with things and figure out what works well first and then clean things up second.

Sounds reasonable.

dotsdl commented 8 years ago

@sseyler had some opinions on this last we spoke. Thoughts?

kain88-de commented 8 years ago

About the PCA. This proposal doesn't affect it. See my last comments in your PR.

For DMaps this could be useful. Do I assume correct the idea is to iterate over all frame pairs and create a matrix (n_frames, n_frames)? This would indeed be a benefitial addition as it makes the pairwise-frame RMSD matrix calculation needed for the DMaps easier.

I would say we rather have a frame_pair method that will be called for all frame_pairs and the BaseAnalysis class handles the iteration. This would look something like this.

def __init__(self, ..., framepair=False):
    self._framepair = framepair

def _frame_pair(self, i, ag_i, j, ag_j):
    # about the API here. This is something I would like for the normal `_single_frame` method
    # as well. But we can also just create more private variables for the iteration but I find that
    # a bit ugly (it's a matter of taste)
    raise NotImplementedError("what should I do with frame 'i' and 'j'?")

def run(self):
    if self._framepair:
        self._iterate_all_frame_pairs()
    else:
        self._iterate_frames()

def _iterate_all_frame_pairs(self):
    # not sure how this should look exactly.
    atoms = self.atoms
    atoms_copy = # copy atoms
    trajectory_copy = atoms_copy.universe.trajectory
    for i, ts_i in enumerate(self._trajectrory[...]):
        # here the iteration can either go really over all pairs or we have some parameter
        # that the result is symmetric and we only iterate over the upper triangle matrix.
        for j, ts_j in enumerate(trajectory_copy[...]):
            self._frame_pair(i, atoms, j, atoms_copy) 

def _iterate_frames(self):
    # current content of run

This example won't be the final API but it goes along the lines what I would think we should do when iterating over frame pairs in an analysis. I have to think about it a little bit more though.

richardjgowers commented 8 years ago

@jdetle the only thing I'd say if you want to chase performance is you need to benchmark properly first using cprofile or something. Then once you've proved we're spending time in a particular area we can look at cythonising it.

MDAnalysis / mdanalysis

[Discussion] All Pairs/ All-All Analysis #905

Issue

Proposal

call either single_frame() and/or all_pairs() here

logger.info("--> Doing frame {} of {}".format(i+1, self.n_frames))