Improving Parcels efficiency

delandmeterp commented 5 years ago

The next main goal in Parcels development is to obtain an efficient parallel version. To do so, let's present here what is the code structure, what are its strengths and weaknesses, how we aim to tackle the different challenges. The point of this issue is to open the discussion such that we use the latest available libraries and best Python practice in Parcels development. This discussion follows on previous parcels developments (PR #265) and discussions (issue #17).

particleSet.execute() structure

Main loop is in Python

while t < tend:
tmp_end = fieldset.computeTimeChunk()  # loading data for the current time
c_loop(tmp_end)  
particleSet.updating()  # adding particles
particleSet.write()

c_loop is implemented in C (here in Python pseudo code)
```
for p in pset:
while p.t < tmp_end:
    kernel(p)
```
Kernels are run in C. They consist of:
- interpolating fields
- updating particle attributes (lon, lat, depth, time, ...)

Potential bottlenecks

pset exporting (see PR #497)
fieldset loading:
- Passing fields to C requires setting all xarray to np.array, preventing most of lazy loading features of xarray, which are xarray's strength
field interpolation: (see details in GMDD paper https://www.geosci-model-dev-discuss.net/gmd-2018-339/)
- index search
- scheme computation
C/Python communication

Parcels profile

To understand better Parcels general profile, different applications will be profiled. This can be done using cprofile from Python:

python -m cProfile -o log.profile run.py
gprof2dot -f pstats log.profile -o profile.dot
dot -T pdf -O profile.dot

The first profile is a 2-month simulation of floating MP in the North Sea:

Particle set: 242 particles released every two days for a month (but all 7502 particles are already created since the beginning)
Field set: NEMO global 1/12 degree, only surface data. Time resolution of 5 days. No indices used
No particle set export

The results are available here: nemo_log0083_singlePrec.pdf

Questions ?

Would it be easy to get the interpolation schemes (in C), to ask from Python's xarray only the few values they need? Basically, to avoid xarray conversion to np.array

Can we have something like:

ds = xr.open_dataset(filename)  # open the xarray dataset
npArray = ds.u.computeChunk(ids)  # load a np.array for some indices (ids)
npArray = ds.u.computeChunk(ids2)  # load a np.array for indices ids2, but avoiding re-loading data from previous line

But this still implies to load contiguous data. So if a particle is located at 170° and another one at -170°, the full domain -170:170 should be loaded? This would be annoying.

Parallel development

OPEN-MP option

This option was presented in PR #265. In the particle loop, the particles are distributed on different threads, sharing the same memory. Fieldset is shared by all threads.

dask option

The Python way to automatically distributes the work between the cores and nodes. See this example from @willirath. https://github.com/willirath/pangeo_parcels

But how to do it efficiently for large input data applications? Can we use dask with some control on the particle distribution? If so, is it a good idea? How to efficiently load the input data?

MPI option

This option requires the more development, but it enables to have a large control over the load balance, how to distribute the particles on the differents processors (see simple proof of concept here https://github.com/OceanParcels/bare-bones-parcels) The idea is the following:

We split the particle set into subsets such that particles are contiguously located in each subset. Each processor processes one subset.
We load the data only in the neigbourhood of the particles (how to do it efficiently? We want a fieldset with dynamic size, taking into account Parcels structure: fieldset are transformed to np.array before passed to C. This problem needs to be solved independently the option chosen for the // implementation.)
Every X time steps, we rebalance the subsets to keep particles efficiently distributed

willirath commented 5 years ago

A note on repeatability and pseudo-random numbers

(I've brought this up in #265 as well.): Usually, it is recommended to conduct all experiments that include pseudo-randomnes in a way that ensures repeatability, by specifying the initial state of the random number generator that is used. (See, e.g., https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003285#s7.) To be fully deterministic, both, the initial state of the pseudo-random-number generator and the sequence of computation, need to be fully specified. The latter is often sacrificed for optimizing occupancy of the allocated resources. OpenMP's dynamic scheduling does this. And dask.array.random also is not deterministic (at least if the cluster differs between executions).

I'm (slowly) giving up my resistance against this trend. But I do think, that as soon as one enters the terrain of non-determinism, this has to be communicated very clearly.

Load balancing without tiling

Disclaimer: I really dislike the idea of adding the complexity of MPI, because the cost on the development side might be enormous, while there would only be benefit for users that have access to a (diminishing) subset of the existing scientific-computing infrastructure. Wall time from idea to final result is hardly ever determined by raw computational efficiency.

There's constraints on the Physics and Math side of Parcels that we can leverage to balance load without falling back to explicit domain decomposition: For any given distribution of particles, we can estimate the largest possible deformation after a given number of time steps. When parallelizing Parcels by distributing .execute() for segments of a ParticleSet, this can be used to construct heuristics for how to segment (or cluser) the particles rather than the physical domain to balance IO etc.

OpenMP

This is a low-hanging fruit. #265 looks very promising. And it also is largely orthogonal to all other efforts towards distribution of Parcels experiments. When distributed with Dask, we can still use intenal parallelization of Parcels experiments with OpenMP. The same is true for MPI, where it would be possible to have hybrid jobs with MPI across nodes and OpenMP on shared memory portions of the cluster.

Handling Scipy mode as first-class citizen

(related: #556)

I think it is very important to make it as easy as possible to not use the built-in JIT functionality of Parcels. There's two reasons for this:

Having a fully equivalent Python-only implementation helps developing and debugging.
Separation of concerns: The task of optimizing numerical Python for different environments is tackled in many different places. There's Numpy swap-ins like bottleneck or bohrium, and other existing JIT solutions like numba that are very promising for efficiently using new architectures like GPUs.

Dask Make it easy to distribute Parcels experiments to arbitrary resources

Using Dask for distributing Parcels to arbitray resources is very promising. I would, however, strongly argue against adding any explicity Dask dependency into Parcels and instead prefer considering Dask support as something that will also pave the way to other approaches to parallelization.

delandmeterp commented 5 years ago

Field chunking

Field should be cut into chunks, such that don't load the fully the data.

How to do that? Here comes one possibility (to be discussed)

We get rid of Field.data. We still have the full Field.grid such that search_indices find the indices (xi, yi, zi) Then according to the indices, we load the chunk ci which is located in Field.data_chunk, which is a list filled with None and np.arrays. ci can be determined quite easily, without interpolating a full field that provides ci for every single indices set. => This requires that the coordinates are fully loaded. Is that a problem? Should we implement something cheaper?

Is there some better solution? Saying that, I let it here for some more discussion and start with OpenMP

willirath commented 5 years ago

I think at that point, before implementing anything just for parcels, we should investigate how to leverage xarrays lazyness and chunking (via Dask).

delandmeterp commented 5 years ago

About OpenMP and randomness:

I started from #265 without all the random development, that works (except for the random issues highlighted by @Simnator101 and @willirath.

If I understood correctly: What we want is (1) a code that is reproducible, (2) not the same random sequences for each thread?

I was thinking that we can simply seed the random generator for each particle in the loop, as srand(init_seed + p.id + pset.exec_call * len(pset)

But I couldn't try this, since actually, I'm still struggling with point (2). I do not have the same random sequence for each thread as I was expecting following @Simnator101 comment (and also here)

In this simple code, I would like to see that every thread generates the same sequence, but they don't. The sequence is global, and randomness comes from which thread calls the sequence first. Wasn't it supposed to be the opposite?

#include <stdlib.h>
#include <time.h>
#include <omp.h>

static inline float parcels_random()
{
  return (float)rand()/(float)(RAND_MAX);
}

static void parcels_seed(int seed)
{
  srand(seed);
}

int main()
{
  int p, npart = 10;
  float* pset = malloc(sizeof(float)*npart);
  int seed = 1234;
  parcels_seed(seed);

  #pragma omp parallel for private(p)
  for(p=0; p<npart; ++p){
    int local_seed = seed+p;
    //srand(local_seed);
    pset[p] = parcels_random();
    int pid = omp_get_thread_num();
    int pnum = omp_get_num_threads();
    printf("p[%d] random generator %1.8f (seed:%d) [proc %d/%d]\n", p, pset[p], local_seed, pid, pnum);
  }

  printf("\nPSET print:\n");
  for(p=0; p<npart; ++p){
    printf("p[%d]: %1.8f\n", p, pset[p]);
  }

  return 0;
}

delandmeterp commented 5 years ago

Also, this link https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003285#s7 asks to have reproducible results. Is it worth to work to have reproducible result independently of OMP_NUM_THREADS or simply reproducible results depending on OMP_NUM_THREADS?

willirath commented 5 years ago

Ideally ...

Ideally, we'd have deterministic results irrespective of the details of the parallelization. All that would matter in this ideal case would be velocity fields, initial positions, length of time step, and the initial state of the random number generator.

There are ways of doing this, but they might be expensive and / or complicated and / or inefficient:

Have one RNG per particle. This is the simplest (as opposed to complex) way of doing it, but it's also very expensive. The Mersenne Twister PRNG has a state worth 624 32-bit integers (which is 2.5 kB). So an experiment with 10e6 particles would carry around an overhead of 2.5 GB just for keeping track of the random states.
Have a global RNG, ensure deterministic order when updating particles. I can see how this would work, but doing this right would be a great (and long) project to learn more about async programming, concurrency, et al.
When paralellizing the for p in pset loop across M threads, we could maintain a separate RNG state per thread and ensure static scheduling. This should be easy to do™. The drawbacks I see are (a) the fact that this introduces a dependency of the runtime details, and (b), the fact that static scheduling might give away some performance when the workloads for the particles are different. (I think that, e.g., particles in a more turbulent region are more expensive to evolve with adaptive time stepping?)

Note that only 1. and 2. above would ensure deterministic results irrespective of the details of the parallelization.

Realistically ...

Realistically, I'd go for 3. above. My expectation as a user would be that an immediate re-run of an experiment should yield the same results.

Further down the road

Note again that thinking hard about 2 and its implications for the design of Parcels (how to achieve real concurrency?) would be a very interesting exercise. :)

willirath commented 5 years ago

In this simple code, I would like to see that every thread generates the same sequence, but they don't. The sequence is global, and randomness comes from which thread calls the sequence first. Wasn't it supposed to be the opposite?

No, I think this code does what it should (at least with the commented //srand(local_seed);): It has a global RNG state and the order in which particles draw numbers from there is not deterministic. Having the same sequence for each thread would be achieved by using a private RNG and seeding it with the same number.

delandmeterp commented 5 years ago

If I seed srand(1234); before each call, I observe such output (not always, but sometimes):

PSET print:
p[0]: 0.00965774
p[1]: 0.00965774
p[2]: 0.00965774
p[3]: 0.00965774
p[4]: 0.00965774
p[5]: 0.00965774
p[6]: 0.31763056
p[7]: 0.00965774
p[8]: 0.00965774
p[9]: 0.00965774

So the random generator does not seem to be thread safe

willirath commented 5 years ago

There's rand_r() that keeps track of its own state: See https://linux.die.net/man/3/srand

Here's an example that seems to do what we want: https://stackoverflow.com/a/46763252

delandmeterp commented 5 years ago

Yes, I've seen it. Basically, rand_r(*seed) returns the random number and updates the seed. But it was annoying, since we don't want to carry this seed everywhere.

delandmeterp commented 5 years ago

Or we keep it global, but private in the openmp loop? I'll give it a try

willirath commented 5 years ago

In the stackoverflow example they unsigned int seed = 42 + omp_get_thread_num(); and rand_r(&seed) With a private seed this should work.

delandmeterp commented 5 years ago

Yes, it would work, but this would mean that we need to pass the seed in the kernels then. But it can as well be done. Other problem, I've read somewhere yesterday that rand_r was less good (like shorter sequence), but there are other rand_r types in case we are not happy with it.

delandmeterp commented 5 years ago

https://linux.die.net/man/3/rand_r

The value pointed to by the seedp argument of rand_r() provides only a very small amount of state, so this function will be a weak pseudo-random generator. Try drand48_r(3) instead.

willirath commented 5 years ago

Eventually, we'd want to go for a good RNG anyway. There's the GSL ones which are nice.

willirath commented 5 years ago

but this would mean that we need to pass the seed in the kernels then

Not necessarilty. My C is very rusty, but isn't there a way of specifying global variales that are thread private?

delandmeterp commented 5 years ago

Or we keep it global, but private in the openmp loop? I'll give it a try

Yes, we can

delandmeterp commented 5 years ago

This simple c code works. random_r.c.zip

Now trying to do better with gsl, but installation is not straigthforward on osx (conda's gcc does not work, I was using a different one, but we want to use conda for dependencies and c librariries. Checking how to do it)

willirath commented 5 years ago

Are the OSX issues related to the wrong linker being used? There's a system path hard coded in gcc I think which messed things up when implicitly linking stuff, if I remember correctly.

delandmeterp commented 5 years ago

Yes, basically if I include this at compilation /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/include/

using this gcc: /anaconda3/envs/py3_parcels_openmp/bin/gcc

before executing, I export also the dynamic library path: export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/anaconda3/envs/py3_parcels_openmp/lib/

And all works fine. That's the manual way. Now checking how to do it better (still using conda gcc)

philippemiron commented 4 years ago

I think at that point, before implementing anything just for parcels, we should investigate how to leverage xarrays lazyness and chunking (via Dask).

Hi everyone,

I've been doing some very simple test with xarray.interp() to see if it could replace RegularGridInterpolator for some very simple particle advection using scipy.integrate.solve_ivp. My velocity field is pretty small so it can easily fit in memory (Dimensions: lat: 220 lon: 380 t: 365).

Here, I interpolated one point (lon_i, lat_i) in the middle of my domain at a random time t_i.

where fu_w (fv_w) is defined using RegularGridInterpolator((lon, lat, t), u, method="linear", bounds_error=False).

I'm not sure what is the most efficient way to do this but it will for sure involve loading chunks of the data...

OceanParcels / Parcels