Closed delandmeterp closed 2 years ago
(I've brought this up in #265 as well.): Usually, it is recommended to conduct all experiments that include pseudo-randomnes in a way that ensures repeatability, by specifying the initial state of the random number generator that is used. (See, e.g., https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003285#s7.) To be fully deterministic, both, the initial state of the pseudo-random-number generator and the sequence of computation, need to be fully specified. The latter is often sacrificed for optimizing occupancy of the allocated resources. OpenMP's dynamic
scheduling does this. And dask.array.random
also is not deterministic (at least if the cluster differs between executions).
I'm (slowly) giving up my resistance against this trend. But I do think, that as soon as one enters the terrain of non-determinism, this has to be communicated very clearly.
Disclaimer: I really dislike the idea of adding the complexity of MPI, because the cost on the development side might be enormous, while there would only be benefit for users that have access to a (diminishing) subset of the existing scientific-computing infrastructure. Wall time from idea to final result is hardly ever determined by raw computational efficiency.
There's constraints on the Physics and Math side of Parcels that we can leverage to balance load without falling back to explicit domain decomposition: For any given distribution of particles, we can estimate the largest possible deformation after a given number of time steps. When parallelizing Parcels by distributing .execute()
for segments of a ParticleSet
, this can be used to construct heuristics for how to segment (or cluser) the particles rather than the physical domain to balance IO etc.
This is a low-hanging fruit. #265 looks very promising. And it also is largely orthogonal to all other efforts towards distribution of Parcels experiments. When distributed with Dask, we can still use intenal parallelization of Parcels experiments with OpenMP. The same is true for MPI, where it would be possible to have hybrid jobs with MPI across nodes and OpenMP on shared memory portions of the cluster.
(related: #556)
I think it is very important to make it as easy as possible to not use the built-in JIT functionality of Parcels. There's two reasons for this:
Having a fully equivalent Python-only implementation helps developing and debugging.
Separation of concerns: The task of optimizing numerical Python for different environments is tackled in many different places. There's Numpy swap-ins like bottleneck
or bohrium
, and other existing JIT solutions like numba
that are very promising for efficiently using new architectures like GPUs.
Using Dask for distributing Parcels to arbitray resources is very promising. I would, however, strongly argue against adding any explicity Dask dependency into Parcels and instead prefer considering Dask support as something that will also pave the way to other approaches to parallelization.
Field should be cut into chunks, such that don't load the fully the data.
How to do that? Here comes one possibility (to be discussed)
We get rid of Field.data
.
We still have the full Field.grid
such that search_indices
find the indices (xi, yi, zi)
Then according to the indices, we load the chunk ci
which is located in Field.data_chunk
, which is a list filled with None and np.arrays.
ci
can be determined quite easily, without interpolating a full field that provides ci
for every single indices set.
=> This requires that the coordinates are fully loaded. Is that a problem? Should we implement something cheaper?
Is there some better solution? Saying that, I let it here for some more discussion and start with OpenMP
I think at that point, before implementing anything just for parcels, we should investigate how to leverage xarrays lazyness and chunking (via Dask).
About OpenMP and randomness:
I started from #265 without all the random development, that works (except for the random issues highlighted by @Simnator101 and @willirath.
If I understood correctly: What we want is (1) a code that is reproducible, (2) not the same random sequences for each thread?
I was thinking that we can simply seed the random generator for each particle in the loop, as
srand(init_seed + p.id + pset.exec_call * len(pset)
But I couldn't try this, since actually, I'm still struggling with point (2). I do not have the same random sequence for each thread as I was expecting following @Simnator101 comment (and also here)
In this simple code, I would like to see that every thread generates the same sequence, but they don't. The sequence is global, and randomness comes from which thread calls the sequence first. Wasn't it supposed to be the opposite?
#include <stdlib.h>
#include <time.h>
#include <omp.h>
static inline float parcels_random()
{
return (float)rand()/(float)(RAND_MAX);
}
static void parcels_seed(int seed)
{
srand(seed);
}
int main()
{
int p, npart = 10;
float* pset = malloc(sizeof(float)*npart);
int seed = 1234;
parcels_seed(seed);
#pragma omp parallel for private(p)
for(p=0; p<npart; ++p){
int local_seed = seed+p;
//srand(local_seed);
pset[p] = parcels_random();
int pid = omp_get_thread_num();
int pnum = omp_get_num_threads();
printf("p[%d] random generator %1.8f (seed:%d) [proc %d/%d]\n", p, pset[p], local_seed, pid, pnum);
}
printf("\nPSET print:\n");
for(p=0; p<npart; ++p){
printf("p[%d]: %1.8f\n", p, pset[p]);
}
return 0;
}
Also, this link https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003285#s7
asks to have reproducible results. Is it worth to work to have reproducible result independently of OMP_NUM_THREADS
or simply reproducible results depending on OMP_NUM_THREADS
?
Ideally, we'd have deterministic results irrespective of the details of the parallelization. All that would matter in this ideal case would be velocity fields, initial positions, length of time step, and the initial state of the random number generator.
There are ways of doing this, but they might be expensive and / or complicated and / or inefficient:
Have one RNG per particle. This is the simplest (as opposed to complex) way of doing it, but it's also very expensive. The Mersenne Twister PRNG has a state worth 624 32-bit integers (which is 2.5 kB). So an experiment with 10e6 particles would carry around an overhead of 2.5 GB just for keeping track of the random states.
Have a global RNG, ensure deterministic order when updating particles. I can see how this would work, but doing this right would be a great (and long) project to learn more about async programming, concurrency, et al.
When paralellizing the for p in pset
loop across M threads, we could maintain a separate RNG state per thread and ensure static scheduling. This should be easy to do™. The drawbacks I see are (a) the fact that this introduces a dependency of the runtime details, and (b), the fact that static scheduling might give away some performance when the workloads for the particles are different. (I think that, e.g., particles in a more turbulent region are more expensive to evolve with adaptive time stepping?)
Note that only 1. and 2. above would ensure deterministic results irrespective of the details of the parallelization.
Realistically, I'd go for 3. above. My expectation as a user would be that an immediate re-run of an experiment should yield the same results.
Note again that thinking hard about 2 and its implications for the design of Parcels (how to achieve real concurrency?) would be a very interesting exercise. :)
In this simple code, I would like to see that every thread generates the same sequence, but they don't. The sequence is global, and randomness comes from which thread calls the sequence first. Wasn't it supposed to be the opposite?
No, I think this code does what it should (at least with the commented //srand(local_seed);
): It has a global RNG state and the order in which particles draw numbers from there is not deterministic. Having the same sequence for each thread would be achieved by using a private RNG and seeding it with the same number.
If I seed srand(1234);
before each call, I observe such output (not always, but sometimes):
PSET print:
p[0]: 0.00965774
p[1]: 0.00965774
p[2]: 0.00965774
p[3]: 0.00965774
p[4]: 0.00965774
p[5]: 0.00965774
p[6]: 0.31763056
p[7]: 0.00965774
p[8]: 0.00965774
p[9]: 0.00965774
So the random generator does not seem to be thread safe
There's rand_r()
that keeps track of its own state: See https://linux.die.net/man/3/srand
Here's an example that seems to do what we want: https://stackoverflow.com/a/46763252
Yes, I've seen it. Basically, rand_r(*seed)
returns the random number and updates the seed. But it was annoying, since we don't want to carry this seed everywhere.
Or we keep it global, but private in the openmp loop? I'll give it a try
In the stackoverflow example they unsigned int seed = 42 + omp_get_thread_num();
and rand_r(&seed)
With a private seed
this should work.
Yes, it would work, but this would mean that we need to pass the seed in the kernels then. But it can as well be done. Other problem, I've read somewhere yesterday that rand_r was less good (like shorter sequence), but there are other rand_r types in case we are not happy with it.
https://linux.die.net/man/3/rand_r
The value pointed to by the seedp argument of rand_r() provides only a very small amount of state, so this function will be a weak pseudo-random generator. Try drand48_r(3) instead.
Eventually, we'd want to go for a good RNG anyway. There's the GSL ones which are nice.
but this would mean that we need to pass the seed in the kernels then
Not necessarilty. My C is very rusty, but isn't there a way of specifying global variales that are thread private?
Or we keep it global, but private in the openmp loop? I'll give it a try
Yes, we can
This simple c code works. random_r.c.zip
Now trying to do better with gsl, but installation is not straigthforward on osx (conda's gcc does not work, I was using a different one, but we want to use conda for dependencies and c librariries. Checking how to do it)
Are the OSX issues related to the wrong linker being used? There's a system path hard coded in gcc I think which messed things up when implicitly linking stuff, if I remember correctly.
Yes, basically if I include this at compilation /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/include/
using this gcc: /anaconda3/envs/py3_parcels_openmp/bin/gcc
before executing, I export also the dynamic library path:
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/anaconda3/envs/py3_parcels_openmp/lib/
And all works fine. That's the manual way. Now checking how to do it better (still using conda gcc)
I think at that point, before implementing anything just for parcels, we should investigate how to leverage xarrays lazyness and chunking (via Dask).
Hi everyone,
I've been doing some very simple test with xarray.interp()
to see if it could replace RegularGridInterpolator
for some very simple particle advection using scipy.integrate.solve_ivp
. My velocity field is pretty small so it can easily fit in memory (Dimensions: lat
: 220 lon
: 380 t
: 365).
Here, I interpolated one point (lon_i, lat_i)
in the middle of my domain at a random time t_i
.
where fu_w
(fv_w
) is defined using RegularGridInterpolator((lon, lat, t), u, method="linear", bounds_error=False)
.
I'm not sure what is the most efficient way to do this but it will for sure involve loading chunks of the data...
The next main goal in Parcels development is to obtain an efficient parallel version. To do so, let's present here what is the code structure, what are its strengths and weaknesses, how we aim to tackle the different challenges. The point of this issue is to open the discussion such that we use the latest available libraries and best Python practice in Parcels development. This discussion follows on previous parcels developments (PR #265) and discussions (issue #17).
particleSet.execute() structure
Potential bottlenecks
Parcels profile
To understand better Parcels general profile, different applications will be profiled. This can be done using
cprofile
from Python:The first profile is a 2-month simulation of floating MP in the North Sea:
The results are available here: nemo_log0083_singlePrec.pdf
Questions ?
But this still implies to load contiguous data. So if a particle is located at 170° and another one at -170°, the full domain -170:170 should be loaded? This would be annoying.
Parallel development
OPEN-MP option
This option was presented in PR #265. In the particle loop, the particles are distributed on different threads, sharing the same memory. Fieldset is shared by all threads.
dask option
The Python way to automatically distributes the work between the cores and nodes. See this example from @willirath. https://github.com/willirath/pangeo_parcels
But how to do it efficiently for large input data applications? Can we use dask with some control on the particle distribution? If so, is it a good idea? How to efficiently load the input data?
MPI option
This option requires the more development, but it enables to have a large control over the load balance, how to distribute the particles on the differents processors (see simple proof of concept here https://github.com/OceanParcels/bare-bones-parcels) The idea is the following: