PrincetonUniversity / athena-public-version

(MOVED) Athena++ GRMHD code and adaptive mesh refinement (AMR) framework. Active repository --->
https://github.com/PrincetonUniversity/athena
BSD 3-Clause "New" or "Revised" License
160 stars 117 forks source link

Pseudorandom number generator ran2() is not thread-safe #28

Closed jlippuner closed 4 years ago

jlippuner commented 4 years ago

I was trying to run the Kelvin-Helmholtz problem with OpenMP (kh.cpp with a modified athinput.kh-slip). The simulation would immediately terminate (Terminating with Terminate Signal, or something like that). Then I compiled with the -debug flag and got a segfault before any other output. I traced the segfault to the random number generator (line 72 in ran2.cpp). I suspect that the random number generator is not thread-safe.

A quick fix is to remove the #pragma omp on line 1342 of mesh.cpp, i.e. the following fixes the issue for me:

    if (res_flag == 0) {
//#pragma omp parallel for num_threads(nthreads)
      for (int i=0; i<nmb; ++i) {
        MeshBlock *pmb = pmb_array[i];
        pmb->ProblemGenerator(pin);
        pmb->pbval->CheckUserBoundaries();
      }
    }
felker commented 4 years ago

Thanks for the report. We are aware that ran2() is not a concurrent PRNG. It can be compatible with MPI e.g. when used as in pgen/turb.cpp, but in that particular case, each rank iterates over the same global sequence produced by the same rseed (a global variable defined on each MPI rank).

We should add an explicit trap to prevent this routine from being called inside a threaded region, and document it in the code and Wiki.

And eventually a thread-safe PRNG (Mersenne twister?) should be added to the code.

beresnyak commented 4 years ago

Std:: already have thread-safe RNGs you can do

std::mt19937 rng;
std::uniform_real_distribution<double> ran2(0.0, 1.0);
rng.seed(something_uint);

and sample numbers as value=ran2(rng)

felker commented 4 years ago

@jlippuner can you post your modified input file and exact configure/run commands?

felker commented 4 years ago

In terms of preventing runtime errors for those who call ran2() within their problem generator, I believe that we could simply make the 3x static variables in the function threadprivate. Or we could put a #pragma omp critical around the body of the function. Thoughts @jmstone ?

jlippuner commented 4 years ago

@felker Thanks for your quick response and I apologize for my delayed one. Here is my config:

./configure.py --prob kh -omp -hdf5 --include /usr/include/hdf5/serial/ --lib_path /usr/lib/x86_64-linux-gnu/hdf5/serial/
Your Athena++ distribution has now been configured with the following options:
  Problem generator:          kh
  Coordinate system:          cartesian
  Equation of state:          adiabatic
  Riemann solver:             hllc
  Magnetic fields:            OFF
  Number of scalars:          0
  Special relativity:         OFF
  General relativity:         OFF
  Frame transformations:      OFF
  Self-Gravity:               OFF
  Super-Time-Stepping:        OFF
  Shearing Box BCs:           OFF
  Debug flags:                OFF
  Code coverage flags:        OFF
  Linker flags:                -L/usr/lib/x86_64-linux-gnu/hdf5/serial/  -lhdf5
  Floating-point precision:   double
  Number of ghost cells:      2
  MPI parallelism:            OFF
  OpenMP parallelism:         ON
  FFT:                        OFF
  HDF5 output:                ON
  HDF5 precision:             single
  Compiler:                   g++
  Compilation command:        g++  -O3 -std=c++11 -fopenmp -I/usr/include/hdf5/serial/

My modified athinput.kh-slip is:

<comment>
problem   = Kelvin-Helmholtz instability
reference =
configure = --prob=kh

<job>
problem_id = kh-slip_8   # problem ID: basename of output filenames

<output1>
file_type  = hdf5       # Binary data dump
variable   = prim      # variables to be output
dt         = 0.1       # time increment between outputs

<time>
cfl_number = 0.4       # The Courant, Friedrichs, & Lewy (CFL) Number
nlim       = 100000    # cycle limit
tlim       = 5.0       # time limit
integrator  = vl2      # time integration algorithm
xorder      = 2        # order of spatial reconstruction
ncycle_out  = 1        # interval for stdout summary info

<mesh>
nx1        = 128       # Number of zones in X1-direction
x1min      = -1.0      # minimum value of X1
x1max      =  1.0      # maximum value of X1
ix1_bc     = periodic  # inner-X1 boundary flag
ox1_bc     = periodic  # inner-X1 boundary flag

nx2        = 64        # Number of zones in X2-direction
x2min      = -0.5      # minimum value of X2
x2max      = 0.5       # maximum value of X2
ix2_bc     = periodic  # inner-X2 boundary flag
ox2_bc     = periodic  # inner-X2 boundary flag

nx3        = 8        # Number of zones in X3-direction
x3min      = -0.125      # minimum value of X3
x3max      = 0.125       # maximum value of X3
ix3_bc     = periodic  # inner-X3 boundary flag
ox3_bc     = periodic  # inner-X3 boundary flag

refinement     = adaptive
numlevel       = 3
deref_count    = 5

num_threads = 48

<meshblock>
nx1 = 8
nx2 = 8
nx3 = 8

<hydro>
iso_sound_speed = 1.0
gamma           = 1.4       # gamma = C_p/C_v

<problem>
iprob = 1
amp   = 0.2
drat  = 2.0
vflow = 0.5
thr = 0.5

Here is the output I get:

./athena -i athinput.kh-slip 
### Warning in Mesh::Initialize
The number of MeshBlocks increased more than twice during initialization.
More computing power than you expected may be required.
### Warning in Mesh::Initialize
The number of MeshBlocks increased more than twice during initialization.
More computing power than you expected may be required.
### Warning in Mesh::Initialize
The number of MeshBlocks increased more than twice during initialization.
More computing power than you expected may be required.

Setup complete, entering main loop...

cycle=0 time=0.0000000000000000e+00 dt=6.3237907014130560e-04
cycle=1 time=6.3237907014130560e-04 dt=6.3204117213331874e-04

Terminating on Terminate signal
time=6.3237907014130560e-04 cycle=1
tlim=5.0000000000000000e+00 nlim=100000

Number of MeshBlocks = 2368; 2240  created, 0 destroyed during this simulation.

zone-cycles = 1212416
cpu time used  = 1.5651121000000000e+01
zone-cycles/cpu_second = 7.7465122146841750e+04

omp wtime used = 3.3088790100009646e-01
zone-cycles/omp_wsecond = 3.6641291396134989e+06

With this small change, it works (this is a naive solution of just turning off OpenMP when calling the problem generator):

diff --git a/src/mesh/mesh.cpp b/src/mesh/mesh.cpp
index c97ce9e..7d0ce0a 100644
--- a/src/mesh/mesh.cpp
+++ b/src/mesh/mesh.cpp
@@ -1339,7 +1339,7 @@ void Mesh::Initialize(int res_flag, ParameterInput *pin) {
     }

     if (res_flag == 0) {
-#pragma omp parallel for num_threads(nthreads)
+//#pragma omp parallel for num_threads(nthreads)
       for (int i=0; i<nmb; ++i) {
         MeshBlock *pmb = pmb_array[i];
         pmb->ProblemGenerator(pin);
felker commented 4 years ago

Fixed upstream in the private repository for the next public release. Here is the patch that you can apply locally:

diff --git a/src/utils/ran2.cpp b/src/utils/ran2.cpp
index defedf1..18b99b6 100644
--- a/src/utils/ran2.cpp
+++ b/src/utils/ran2.cpp
@@ -45,6 +45,7 @@ double ran2(std::int64_t *idum) {
   static std::int64_t idum2=123456789;
   static std::int64_t iy=0;
   static std::int64_t iv[NTAB];
+#pragma omp threadprivate(iy,iv,idum2)
   double temp;

   if (*idum <= 0) { // Initialize
jmstone commented 4 years ago

I like this solution. Thanks Kyle!

On Oct 30, 2019, at 11:24 AM, Kyle Gerard Felker notifications@github.com wrote:

Fixed upstream in the private repository for the next public release. Here is the patch that you can apply locally:

diff --git a/src/utils/ran2.cpp b/src/utils/ran2.cpp index defedf1..18b99b6 100644 --- a/src/utils/ran2.cpp +++ b/src/utils/ran2.cpp @@ -45,6 +45,7 @@ double ran2(std::int64_t *idum) { static std::int64_t idum2=123456789; static std::int64_t iy=0; static std::int64_t iv[NTAB]; +#pragma omp threadprivate(iy,iv,idum2) double temp;

if (*idum <= 0) { // Initialize — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/PrincetonUniversity/athena-public-version/issues/28?email_source=notifications&email_token=ABB3XTUMVSSDMG3PT452YOTQRGRMHA5CNFSM4JEUDBWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECUTYRA#issuecomment-547961924, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABB3XTVHP6XRXSQML5H7FKTQRGRMHANCNFSM4JEUDBWA.

jlippuner commented 4 years ago

This solution works for me. Thanks!

jlippuner commented 4 years ago

But it just occurred to me: are the random number generators in each thread now independent of each other or will they all produce the same sequence of numbers?

felker commented 4 years ago

The PRNG is seeded with std::int64_t iseed = -1 - gid; based on the global index of each MeshBlock in kh.cpp, and the Athena++ threading strategy is over MeshBlocks on a single MPI rank, so each thread should have separate sequences of numbers for each of its MeshBlocks.

jlippuner commented 4 years ago

Great!

beresnyak commented 4 years ago

Actually, using threadprivate is extremely tricky and is normally avoided. For example, in the manual https://www.openmp.org/wp-content/uploads/OpenMP4.0.0.pdf you can find "The address of a threadprivate variable is not an address constant" - the program can create or destroy these variables depending on the number of currently running threads. These are not static variables in an ordinary sense. And if the threads are dynamic it is explicitly prohibited to use threadprivate (page 151 line 26).

Why use this particular RNG implementation in the problem generator? Just copy one from GSL or use standard library.

felker commented 4 years ago

For @jlippuner's particular case, all calls to ran2() occur in the same parallel construct, and the omp parallel for uses the implementation-defined default scheduling type (probably static), so I dont think it is an issue here.

I agree with you that a different PRNG should be used , in general. But we leave that to application users with that specific, more complex need. Here, I just wanted to ensure that the simple problem generator compiles and runs correctly with OpenMP.