Open jdetle opened 8 years ago
If I'm understanding right, the current analysis base only handles single frame by frame. So this is for a new pattern that does pairs of frames?
On Fri, 15 Jul 2016, 21:40 John Detlefs, notifications@github.com wrote:
Issue
I was talking with @orbeckst https://github.com/orbeckst and @sseyler https://github.com/sseyler at Scipy and I brought up the fact that it seems for many analysis cases require the creation of a matrix that calculates some value for all possible i,j frames in a trajectory. This is a doubly nested loop. IIRC, In the case of ~MDAnalysis.analysis.psa it is a loop over all these doubly nested loops. These are problems that don't make much sense to abstract into a _single_frame call. Tools presented at scipy like numba, dask, and loopy more easily optimize a doubly nested loop and other things that can be easily interpreted into C code. Although the single_frame idea easily corresponds to the idea of a map reduce operation, I contend that it is a poor abstraction in the use case of PCA, PSA, and Diffusion Maps (and probably hurts speedup). Proposal
This is something that I want @MDAnalysis/coredevs https://github.com/orgs/MDAnalysis/teams/coredevs to discuss so we can figure out the best method. Some abstraction like:
def call(traj_index, ts):
call either single_frame() and/or all_pairs() here
pass def single_frame(traj_index, ts): pass def all_pairs(traj_index, ts): pass
ef run(self): """Perform the calculation""" logger.info("Starting preparation") self._prepare() for i, ts in enumerate( self._trajectory[self.start:self.stop:self.step]): self._frame_index = i self._ts = ts self._call
logger.info("--> Doing frame {} of {}".format(i+1, self.n_frames))
self._pm.echo(self._frame_index) logger.info("Finishing up") self._conclude() return self
— You are receiving this because you are on a team that was mentioned. Reply to this email directly, view it on GitHub https://github.com/MDAnalysis/mdanalysis/issues/905, or mute the thread https://github.com/notifications/unsubscribe-auth/AI0jBwSY_7DCPY5gaBrhD8teCNG621znks5qV_A5gaJpZM4JNw_3 .
I think I screwed up the pseudocode a little bit. Sorry I'm at scipy and a little distracted. _all_pairs
should only be called once in theory, returning a matrix of all possible pairwise values over a trajectory.
“All pairs” is a common pattern, and it would be very useful to think about how to do this, or at least, if it would be useful to have a common API.
On 15 Jul, 2016, at 15:51, John Detlefs notifications@github.com wrote:
I think I screwed up the pseudocode a little bit. Sorry I'm at scipy and a little distracted. _all_pairs should only be called once in theory.
Oliver Beckstein * orbeckst@gmx.net skype: orbeckst * orbeckst@gmail.com
Many spectral methods require the creation of a matrix that is essentially data x data, scikit learn has the module ~sklearn.metrics.pairwise
in which it has the function def pairwise_distances
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/metrics/pairwise.py#L1129. This data x data matrix is very common.
PCA (in one of two algorithms) requires the generation of an all pairs matrix in which its all pairs of possible covariances instead of metrics, which further reinforces the convenience of this abstraction.
As an aside, scikit-learn uses SVD of a data matrix to generate the principal components. The proof for why SVD creates the principal components is fairly elegant!
@richardjgowers I want to address this before moving forward in PCA. I can update my schedule accordingly, but this will facilitate tremendous speedups for plenty of analysis methods.
Just some thought. I would call the method _all_frame_pairs
to be more explicit. Also, would it be possible to have analyses defining a _single_frame_pair
method, and let a generic method take care of running all the pairs? In the same idea as for _single_frame
, it would remove some responsibility to the analysis dev, and let us deal with mapping/reducing the matrix latter on. There always is the possibility to overwrite the method that build the whole matrix for strange cases.
There should be a significant speedup simply by writing an all_frame_pairs
function in cython that would not happen in a contrived single_frame
function. This should be the first goal of a speedup from this abstraction. Map-reduce parallelization is a more daunting task. In my opinion, first we should figure out how to implement a working parallelization and then figure out any appropriate abstractions to help out an analysis dev second. I think this is a situation where it is beneficial to play around with things and figure out what works well first and then clean things up later.
For the purpose of getting some work done, I could make the distance matrix in diffusion maps significantly faster right now simply by refactoring the code to use a cythonized all_frame_pairs
function. This is a significantly more straightforward task than parallelization. It also applies to quickly calculating a covariance matrix in PCA.
@richardjgowers This is what I would like to work on right now. Our PCA expert is on vacation and the speedups from cython and an all_frame_pairs
function should be noticeable. What do you think?
In my opinion, first we should figure out how to implement a working parallelization and then figure out any appropriate abstractions to help out an analysis dev second. I think this is a situation where it is beneficial to play around with things and figure out what works well first and then clean things up second.
Sounds reasonable.
@sseyler had some opinions on this last we spoke. Thoughts?
About the PCA. This proposal doesn't affect it. See my last comments in your PR.
For DMaps this could be useful. Do I assume correct the idea is to iterate over all frame pairs and create a matrix (n_frames, n_frames)? This would indeed be a benefitial addition as it makes the pairwise-frame RMSD matrix calculation needed for the DMaps easier.
I would say we rather have a frame_pair
method that will be called for all frame_pairs and the BaseAnalysis
class handles the iteration. This would look something like this.
def __init__(self, ..., framepair=False):
self._framepair = framepair
def _frame_pair(self, i, ag_i, j, ag_j):
# about the API here. This is something I would like for the normal `_single_frame` method
# as well. But we can also just create more private variables for the iteration but I find that
# a bit ugly (it's a matter of taste)
raise NotImplementedError("what should I do with frame 'i' and 'j'?")
def run(self):
if self._framepair:
self._iterate_all_frame_pairs()
else:
self._iterate_frames()
def _iterate_all_frame_pairs(self):
# not sure how this should look exactly.
atoms = self.atoms
atoms_copy = # copy atoms
trajectory_copy = atoms_copy.universe.trajectory
for i, ts_i in enumerate(self._trajectrory[...]):
# here the iteration can either go really over all pairs or we have some parameter
# that the result is symmetric and we only iterate over the upper triangle matrix.
for j, ts_j in enumerate(trajectory_copy[...]):
self._frame_pair(i, atoms, j, atoms_copy)
def _iterate_frames(self):
# current content of run
This example won't be the final API but it goes along the lines what I would think we should do when iterating over frame pairs in an analysis. I have to think about it a little bit more though.
@jdetle the only thing I'd say if you want to chase performance is you need to benchmark properly first using cprofile or something. Then once you've proved we're spending time in a particular area we can look at cythonising it.
Issue
I was talking with @orbeckst and @sseyler at Scipy and I brought up the fact that it seems for many analysis cases require the creation of a matrix that calculates some value for all possible i,j frames in a trajectory. This is a doubly nested loop. IIRC, In the case of
~MDAnalysis.analysis.psa
it is a loop over all these doubly nested loops. These are problems that don't make much sense to abstract into a_single_frame
call. Tools presented at scipy like numba, dask, and loopy more easily optimize a doubly nested loop and other things that can be easily interpreted into C code. Although thesingle_frame
idea easily corresponds to the idea of a map reduce operation, I contend that it is a poor abstraction in the use case of PCA, PSA, and Diffusion Maps (and probably hurts speedup).Proposal
This is something that I want @MDAnalysis/coredevs to discuss so we can figure out the best method. Some abstraction like: