Closed nawtrey closed 5 years ago
The current version works by splitting up the input atom group into subgroups, calculating the RMSF for the atoms in the subgroups in parallel, then combining the results of the subgroups to produce one list of RMSF values for the original input atom group. After benchmarking the current version, there appears to be a positive correlation between the number of blocks (subgroups of atoms) and the time it takes to calculate the RMSF of all the atoms. This was confirmed by running the MDAnalysis RMSF function over a 90ns trajectory for an atom group size of 1, 10, 20, 40, 80, 160, 320, 500, 1000, 1500, and 2000 atoms. The time difference between calculating the RMSF for 1 versus 2000 atoms (of the same length trajectory) is negligible. More specifically, when ran on my workstation, there is only a 2.76% increase in the time required to calculate the RMSF for 2000 atoms compared to 1, utilizing only 1 core. This illustrates that splitting the given (ts x N x 3) array up over N atoms will not give any significant increase in performance due to the small quantity of time spent doing the calculations. The current version of pRMSF actually takes longer than MDA RMSF (especially for higher core numbers) because Dask has to setup the graph and scheduler, which require additional time.
Moving forward, this needs to be adapted over to a timestep-splitting parallelization. For the record this has been the goal all along, but the logistics of calculating an RMSF value for a time-split trajectory are a bit hairy. The original idea was to use the current single-timestep function in MDA (written via the algorithm from Welford et al.) and simply apply it to M sub-trajectories in parallel, and find a method to combine these partial-RMSF’s. Finding a method for combining these has proven difficult. Mathematically speaking, there is no difference between calculating the RMSF over the same trajectory for different atoms, and calculating the RMSF over different partial-trajectories for the same atom; these values are independent of each other.
With that said, if we make the assumption that a relationship exists, executing that relationship will be a challenge. The current model uses a recursive single-frame calculation that is iterated through a trajectory. This recursive calculation relies on having access to the previously calculated partial-mean and corrected sum of squares, and it isn’t until the final time step that the actual time-averaged position (and RMSF) is calculated. If you split a trajectory into sub-trajectories the first sub-trajectory will be calculated exactly the same, but all subsequent sub-trajectories will not account for the previous ones. The MDA RMSF algorithm implements an index-weighted partial-mean, where the subsequent sub-trajectories would either need access to the previous sub-trajectory’s final partial-mean, or to the positions of the atom for all previous time steps. The first option relies on the sub-trajectories finishing in order (the opposite of what we want) while the second option relies on loading more than one time step (up to the last time step), which would be very time-intensive considering I/O processes are so expensive already.
Instead of looking directly at RMSF, I tried to find a relationship between the corrected sums of squares for sub-trajectories. If a relationship could be developed for these partial-corrected sums of squares, one could calculate the total RMSF for a given trajectory from a group of sub-trajectories by putting the last few operations in the conclude function. The corrected sum of squares is the sum (over all time steps) of the square of the difference between the position of the atom (at each time step) and its time-averaged position. It cannot be calculated in pieces because of this difference. If sub-trajectories are analyzed in parallel, the first point at which the time-averaged position will be available is in the conclude function. This means all but one of the sub-trajectories would be subtracting out an inaccurate partial-mean through all of the corrected sum of squares calculations. Thus, this method suffers the same pitfalls as mentioned above.
I am currently looking at splitting the trajectory such that each block loads one frame each. This way the function would load a single frame into each block, calculate the squares of the position, send the outputs of all blocks to the conclude function, calculate the partial-mean and partial-corrected sum of squares, use the recursive method from MDAnalysis to iterate through all sets of blocks (where the partial-mean would be updated every time the conclude function was called), and lastly calculate the RMSF. This seems to be the only viable option at this time, though it has yet to be fully written and tested.
As discussed, try to make the parallel algorithm from one of the following work:
I would close this PR and open a fresh one.
Fixes #90
Changes made in this Pull Request:
PR Checklist