OceanParcels / Parcels

Main code for Parcels (Probably A Really Computationally Efficient Lagrangian Simulator)
https://www.oceanparcels.org
MIT License
294 stars 135 forks source link

Investigate using Dask for NetCDF handling #216

Closed erikvansebille closed 6 years ago

erikvansebille commented 7 years ago

Reading of hydrodynamic fields is currently handled through the Unidata Netcdf python library.

However, it could be that we can gain significant speed-ups by using Dask https://dask.pydata.org/en/latest/. This is certainly worth exploring for Parcels v1.0

guidov commented 7 years ago

Possibly consider integrating xarray (https://github.com/pydata/xarray) into parcels.

See: http://xarray.pydata.org/en/stable/dask.html

erikvansebille commented 7 years ago

Fully agree, see also the very good write-up of the advantages of xarray and dask at https://pangeo-data.github.io/guidelines/

This is high on the priority list for v1.0

guidov commented 7 years ago

Wasn't aware of pangeo, thanks!

On Thu, Sep 7, 2017 at 11:29 AM, Erik van Sebille notifications@github.com wrote:

Fully agree, see also the very good write-up of the advantages of xarray and dask at https://pangeo-data.github.io/guidelines/

This is high on the priority list for v1.0

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/OceanParcels/parcels/issues/216#issuecomment-327835688, or mute the thread https://github.com/notifications/unsubscribe-auth/AC2IMJRn4blPhKFHLJNUAgHfKwYqpWC2ks5sgAvwgaJpZM4OlCp5 .

delandmeterp commented 6 years ago

Now, Fields are loaded using xarray. dask should improve the way we compute specific operations on those data, but we do not currently do such operations. At a Python level, we simply load those data, which are used for interpolation at the C level. We don't need then to use dask (for the moment). However, we are working for a smarter (and cheaper) loading of the data into memory.

willirath commented 6 years ago

On a related note, it would be interesting to try and use dask.bag to call pset.execute for sub-sets of a particle set. This would allow for running multiple instances of Parcels on different threads, processes, or even on workers on different machines. If then JIT code then would use OpenMP, Parcels could be scaled to a massively parallel / distributed system without much of a logistical overhead for the user.

As far as I can see (after a brief test I did a while ago), there's a few things that currently prevent this from just working:

willirath commented 6 years ago

The idea is as follows:

import dask.bag as db
import numpy as np

# a simple time step
time_step = lambda p, v: p + 0.1 * v

# initial positions and a velocity vector
initial_positions = np.linspace(0, 1, 10)
velocities = np.random.randn(*initial_positions.shape)

# apply time step
next_positions = time_step(initial_positions, velocities)

# create dask bags of positions and velocities
initial_positions_bag = db.from_sequence(initial_positions, partition_size=2)
velocities_bag = db.from_sequence(velocities, partition_size=2)

# apply time step using dask bag (no computation yet)
next_positions_bag = db.map(time_step, initial_positions_bag, velocities_bag)

# do the actual computation
next_positions_bag.compute()

for npos, nposb in zip(next_positions, next_positions_bag):
    print(npos, nposb)

Output:

0.012643000434015473 0.012643000434015473
0.26248329743689197 0.26248329743689197
0.5195525904069955 0.5195525904069955
0.38539757889911525 0.38539757889911525
0.45978965277591133 0.45978965277591133
0.630505628317729 0.630505628317729
0.7183645314996056 0.7183645314996056
0.7159878796528008 0.7159878796528008
0.9169828337328265 0.9169828337328265
0.7585102369250991 0.7585102369250991

The next_positions_bag.compute() will use any dask backend currenly loaded: a single thread, multiple threads, multiple processes, or a cluster.

willirath commented 6 years ago

See https://gist.github.com/willirath/fbc643866bf7f5f77d5becd3b13a01b1 for a first working version of the brownian motion example vs. dask.