Fixed SP test issue - Githubissues

whannah1 commented 6 years ago

ERP tests were failing for SP compsets at the COMPARE_base_rest stage. Adding a new Mersenne Twister RNG module to set the CRM perturbations alleviated the issue.

Some rambling details:

In an unrelated effort for implementing a random walk to make the CRM orientation evolve over time I stumbled onto a concerning discrepancy between the behavior of the fortran intrinsic random number generator (RNG) when compiling with gfortran and pgi. Off-line results from the pgi compiler showed that the random numbers were scaled by the seed value, and so small seed values resulted in near zero random values. The solution is to build in a RNG so that the results are consistent across compilers. Since I was already doing this, I figured I'd see whether the test results would be affected by using this new RNG to set the CRM perturbations in setperturb(). I also changed the condition that decides whether to call setperturb() and based it on the logical "is_first_step" instead of the old method that compared a few arbitrary CRM cells to see if they were equal. These changes resulted in a passing test of COMPARE_base_rest, however I'm lacking a complete explanation of why this worked. In theory, neither change chould have affected this comparison as long as the intrinsic RNG got the same seed and gave the same results, even if those values were near zero. The difference may lie in how the intrinsic RNG was called such that the string of random numbers was "remembered" from the previous seed, in which case changing the processor count may result in a different sequence. Either way, the new RNG is much more robust and reliable. The new RNG uses the Mersenne Twister algorithm. There is already a Mersenne Twister RNG implemented elsewhere in the model, but it is cumbersome to get single values, so I think the current approach is preferable for the CRM.

mrnorman commented 6 years ago

@worleyph , in the past, you provided me with a PRNG for this exact same problem. Is it, by chance, better than the one @whannah1 put in this PR? I think you had grabbed it from a fusion code or something.

worleyph commented 6 years ago

@mrnorman , sorry for the delay in responding - I'm on personal travel at the moment.

I'll have to read up on this issue and also remind myself what I did earlier (and why). I know that the atmosphere group preferred their own random generator, which was also already either in the code or in their branch, because they were comfortable with it and it returned a vector of results?

worleyph commented 6 years ago

From e-mails (November-December 2014):

 For the fusion project I work with (particle in cell code) they have a random number 
 generator  that works well as far as I know, and I modified it to be threadsafe and 
 also "reproducible"  when changing process and thread counts. I can ask the PI 
 whether it is okay to share this if  you are interested in trying something totally 
 different.

then

 I've asked permission from the PI of the EPSi project, but they are using an in-lined 
 open source code any way:

 MODULE Ecuyer_random
 ! L'Ecuyer's 1996 random number generator.
 ! Fortran version by Alan.Miller @ vic.cmis.csiro.au
 ! N.B. This version is compatible with Lahey's ELF90
 ! http://www.ozemail.com.au/~milleraj
 ! Latest revision - 30 March 1999

 though I have made some minor modifications to it for use in a parallel environment.

then

 I did a quick search, and the L'Ecuyer random number generators are considered 
 at least as good as those of Marsaglia (the basis of the fortran instrinsic), so that 
 is reassuring. This code is pretty old though, but then we are not doing 
 cryptography, so our requirements are probably fine for 1990s technology ;-).

then

 Hi Matt,
     From ~worley/ACPI/SVN/ACME/mrnorman/cam-se/perf-sensitivity/ACME/models/atm/cam/src/dynamics/se grab

 inidat.F90
 random_xgc.F90

 then after you create a new case, edit the Macros file to add '-D USE_RANX' , e.g.

 CPPDEFS+= -DFORTRANUNDERSCORE -DNO_SHR_VMATH -DNO_R16   -DLINUX  -DCNL -DCPRPGI -DUSE_RANX

and then Matt's PR #56 (https://github.com/ACME-Climate/ACME/pull/56).

Matt's instructions are a little different than mine (new_random = .true. ?, other namelist arguments?) but random_xgc.F90 is still in the model, so is probably available to be tried.

whannah1 commented 6 years ago

@worleyph, is there an advantage of using Tristen's RNG over the Mersenne Twister algorithm?

worleyph commented 6 years ago

Note that random_xgc.F90 supports reproducbile random numbers (if use the same seed) in a parallel code, including threading. In particular:

  ! if ranx is to be executed in a threaded region,
  ! then each thread must set its own seed, otherwise
  ! each thread generates the identical sequence

Matt implemented this correctly (at the time), but, if this part of the code has changed over time, you should be aware of this.

whannah1 commented 6 years ago

My version seeds based on the global column number, so it should be reproducible across different PE layouts.

worleyph commented 6 years ago

@whannah1 ,

is there an advantage of using Tristen's RNG over the Mersenne Twister algorithm?

No idea. I was just responding to @mrnorman 's request. When we last ran into the problems with the Fortran intrinsic RNG (PGI in particular), I had access to this code, used in a production plasma microturbulence code. I was familiar with it as I had modified it to make it threadsafe. Other than that, I have no special knowledge. It is already in E3SM however, for whatever that is worth.

mrnorman commented 6 years ago

It being threadsafe is pretty nice. I dunno if we'll need that, but you never know.

whannah1 commented 6 years ago

If it doesn't matter to you guys I'd prefer to use this new one. I like how simple the interface is for getting small sets of numbers and I've already been using it for another part of the CRM.

What's involved in making it "thread safe" aside from making sure the seed is the same for each column?

worleyph commented 6 years ago

@whannah1 ,

is there an advantage of using Tristen's RNG over the Mersenne Twister algorithm?

No idea. I was just responding to @mrnorman 's request. When we last ran into the problems with the Fortran intrinsic RNG (PGI in particular), I had access to this code, used in a production plasma microturbulence code. I was familiar with it as I had modified it to make it threadsafe. Other than that, I have no special knowledge. It is already in E3SM however, for whatever that is worth.

mrnorman commented 6 years ago

It shouldn't be too hard to do if we need to. We just need to put locks (i.e., "!$omp critical")around certain parts of the code to make sure two threads don't grab the same PRN or seed, etc.

worleyph commented 6 years ago

My version seeds based on the global column number, so it should be reproducible across different PE layouts.

My connection is going in and out - second time trying to send this.

Just curious, but not enough to read the code :-), how do you generate separate sequences for each column when multiple columns are assigned to a thread (even thread 0)?

worleyph commented 6 years ago

It shouldn't be too hard to do if we need to. We just need to put locks (i.e., "!omp critical")around certain parts of the code to make sure two threads don't grab the same PRN or seed, etc.

I provided thread-specific random number sequences, but this also required making sure that the same thread handles a column (for the fusion code, particles) during each timestep. Not hard to do, but eliminates any dynamics OpenMP scheduling algorithms.

whannah1 commented 6 years ago

Oh I think I get it. So when I set the seed, if there are multiple threads then there's no guarantee that another thread will re-set the seed before I actually get the random sequence I need.

For now that doesn't matter because the CRM has to be run without threading. So I think we're ok.

mrnorman commented 6 years ago

Would dynamic OMP schedules work efficiently for us? We'd have to have significant load imbalance and probably significantly more work for that to become practical, I think. We should be safe either way.

E3SM-Project / ACME-ECP

Fixed SP test issue #30