MDAnalysis / mdanalysis

MDAnalysis is a Python library to analyze molecular dynamics simulations.
https://mdanalysis.org
Other
1.32k stars 652 forks source link

Expose Dask "lazy timeseries" from compatible readers for full parallelism in analysis #4713

Open ljwoods2 opened 1 month ago

ljwoods2 commented 1 month ago

Is your feature request related to a problem?

This idea follows up on @orbeckst's suggestion from a few months ago and a discussion with @hmacdope about making full use of dask in mda.

Current parallelism development allows splitting a trajectory into a number of parts and then combining intermediate results. However, allowing analysis classes to use dask arrays for positions, velocities, forces across the entire trajectory can cover cases that the split-apply-combine paradigm doesn't cover (like RMSF, AFAIK) and potentially lead to greater speedup.

Describe the solution you'd like

A DaskTimeSeriesAnalysisBase which accepts a dasktimeseries as an argument. A dask timeseries is exactly the same as a reader's timeseries except that it is a dask.array rather than a numpy.ndarray, so it is loaded lazily into memory and a dask task graph is created and optimized by dask automatically before .compute() is called.

Describe alternatives you've considered

Do nothing.

Additional context

I provide an extremely minimal example in PR #4714. Here, using dask to perform RMSF rather than in serial leads to a speedup of ~15x

Sample notebook available here: https://github.com/ljwoods2/mdanalysis/blob/dask-timeseries/tmp/lazyts.ipynb

orbeckst commented 1 month ago

Good idea (although RMSF (and anything that computes higher order moments) can be made to work with split-apply-combine, see PMDA RMSF and Nik's report referenced therein).

ljwoods2 commented 1 month ago

Oh thank you, didn't know that existed, that would be a far better comparison