Open ljwoods2 opened 1 month ago
Good idea (although RMSF (and anything that computes higher order moments) can be made to work with split-apply-combine, see PMDA RMSF and Nik's report referenced therein).
Oh thank you, didn't know that existed, that would be a far better comparison
Is your feature request related to a problem?
This idea follows up on @orbeckst's suggestion from a few months ago and a discussion with @hmacdope about making full use of dask in mda.
Current parallelism development allows splitting a trajectory into a number of parts and then combining intermediate results. However, allowing analysis classes to use dask arrays for positions, velocities, forces across the entire trajectory can cover cases that the split-apply-combine paradigm doesn't cover (like RMSF, AFAIK) and potentially lead to greater speedup.
Describe the solution you'd like
A
DaskTimeSeriesAnalysisBase
which accepts adasktimeseries
as an argument. A dask timeseries is exactly the same as a reader'stimeseries
except that it is adask.array
rather than anumpy.ndarray
, so it is loaded lazily into memory and a dask task graph is created and optimized by dask automatically before.compute()
is called.Describe alternatives you've considered
Do nothing.
Additional context
I provide an extremely minimal example in PR #4714. Here, using dask to perform RMSF rather than in serial leads to a speedup of ~15x
Sample notebook available here: https://github.com/ljwoods2/mdanalysis/blob/dask-timeseries/tmp/lazyts.ipynb