NCAR / DART

Data Assimilation Research Testbed
https://dart.ucar.edu/
Apache License 2.0
198 stars 145 forks source link

compiler bug: broken random number generator with cce on Derecho. #495

Closed mjs2369 closed 9 months ago

mjs2369 commented 1 year ago

:bug: Your bug may already be reported!

Describe the bug

The random number generator code in DART will not compile with cce on Derecho. Edit: the code compiles, but gives incorrect results More specifically, the subroutines init_ran, ran_unif, ran_gauss, and ran_gamma are all incompatible with cce. I believe this is because they all make use of code from the GNU Scientific Library:

image

image

  1. List the steps someone needs to take to reproduce the bug.
    Run ./filter with any model with "perturb_from_single_instance = .true." in the namelist OR run ./test_gaussian or ./test_gamma in DART/developer_tests/random_seq/work.

  2. What was the expected outcome? The executables run successfully.

  3. What actually happened?
    An run-time error halts the execution

Error Message

Please provide any error messages.

ERROR FROM: source : random_seq_mod.f90 routine: ran_gauss message: if both x and y are -1, random number generator probably not initialized message: ... x, y = -3510081565.7593699, -295496494.86667526

image image image

actual mean should be close to .50 image

Which model(s) are you working with?

All models, also the test_gaussian, test_random, and test_gamma developer tests in DART/developer_tests/random_seq/work.

Version of DART

Which version of DART are you using? You can find the version using git describe --tags

v10.7.3

Have you modified the DART code?

No

Build information

Please describe:

  1. The machine you are running on (e.g. windows laptop, NCAR supercomputer Cheyenne).
  2. The compiler you are using (e.g. gnu, intel).

Derecho, cce

hkershaw-brown commented 1 year ago

@mjs2369 just chatting to Jeff about this. A better test to look at what is going on is to generate the sequence of random numbers from a given seed. So k is what we are interested in:

https://github.com/NCAR/DART/blob/502af7865e258530cba7c7daff6aed4390cbb649/assimilation_code/modules/utilities/random_seq_mod.f90#L415-L419

and this should be the same across compilers.

hkershaw-brown commented 1 year ago

e.g. https://github.com/NCAR/DART/tree/rand_test

hkershaw-brown commented 1 year ago

For the curious: gfortran, intel on my mac and derecho give this for k. 11 numbers seeded with 13:

k= 3340206418 k= 2608511152 k= 1020231754 k= 3691240976 k= 3540249318 k= 3835331426 k= 4147861236 k= 769458329 k= 4177289964 k= 3258093498 k= 1947549667

cce on derecho gives: k= -5939786187531372199 k= -7603175559541411156 k= 2499092022097743661 k= -6392185873013553955 k= 1418358412448069790 k= 1601992904522816967 k= 4918056359950545492 k= -7859870468495140367 k= -5366954201424499693 k= 4633693547982415675 k= -5357398119243707470

hkershaw-brown commented 1 year ago

Chatting to Marlee, we think this might be a compiler bug:

hkershaw@derecho6:/glade/derecho/scratch/hkershaw/test_code$ module load intel
hkershaw@derecho6:/glade/derecho/scratch/hkershaw/test_code$ cat boz_dart.f90 
program boz_dart

implicit none

integer, parameter :: i8 = SELECTED_INT_KIND(13)

! hexadecimal constants
integer(i8), parameter :: UPPER_MASK  = int(z'0000000080000000', i8) 
integer(i8), parameter :: LOWER_MASK  = int(z'000000007FFFFFFF', i8) 
integer(i8), parameter :: FULL32_MASK = int(z'00000000FFFFFFFF', i8) 
integer(i8), parameter :: magic       = int(z'000000009908B0DF', i8) 
integer(i8), parameter :: C1          = int(z'000000009D2C5680', i8) 
integer(i8), parameter :: C2          = int(z'00000000EFC60000', i8) 

write(*, '(a, i20, 1x, z16)') "UPPER_MASK  =", UPPER_MASK, UPPER_MASK
write(*, '(a, i20, 1x, z16)') "LOWER_MASK  =", LOWER_MASK, LOWER_MASK
write(*, '(a, i20, 1x, z16)') "FULL32_MASK =", FULL32_MASK, FULL32_MASK
write(*, '(a, i20, 1x, z16)') "magic       =", magic, magic
write(*, '(a, i20, 1x, z16)') "C1          =", C1, C1
write(*, '(a, i20, 1x, z16)') "C2          =", C2, C2

end program boz_dart
hkershaw@derecho6:/glade/derecho/scratch/hkershaw/test_code$ module load intel
hkershaw@derecho6:/glade/derecho/scratch/hkershaw/test_code$ ftn boz_dart.f90 
hkershaw@derecho6:/glade/derecho/scratch/hkershaw/test_code$ ./a.out 
UPPER_MASK  =          2147483648         80000000
LOWER_MASK  =          2147483647         7FFFFFFF
FULL32_MASK =          4294967295         FFFFFFFF
magic       =          2567483615         9908B0DF
C1          =          2636928640         9D2C5680
C2          =          4022730752         EFC60000
hkershaw@derecho6:/glade/derecho/scratch/hkershaw/test_code$ module load cce

Lmod is automatically replacing "intel/2023.0.0" with "cce/15.0.1".

Due to MODULEPATH changes, the following have been reloaded:
  1) cray-mpich/8.1.25     2) hdf5/1.12.2     3) ncarcompilers/1.0.0     4) netcdf/4.9.2

hkershaw@derecho6:/glade/derecho/scratch/hkershaw/test_code$ ftn boz_dart.f90 
hkershaw@derecho6:/glade/derecho/scratch/hkershaw/test_code$ ./a.out 
UPPER_MASK  =         -2147483648 FFFFFFFF80000000
LOWER_MASK  =          2147483647         7FFFFFFF
FULL32_MASK =                  -1 FFFFFFFFFFFFFFFF
magic       =         -1727483681 FFFFFFFF9908B0DF
C1          =         -1658038656 FFFFFFFF9D2C5680
C2          =          -272236544 FFFFFFFFEFC60000
hkershaw-brown commented 1 year ago

@mjs2369 Hi Marlee, did a bug report for this get sent to cray (by you or CISL help)?

mjs2369 commented 1 year ago

@hkershaw-brown I don't believe so. I have a request on for CISL help under "Support wait" where they said they were going to reach out to their contact for any input/fixes, but I haven't heard back from them. I just added another comment to the request to see if there are any updates.

mjs2369 commented 1 year ago

@hkershaw-brown

Update on this issue - CISL Support responded to my request after contacting HPE/Cray

This bug was patched in the lastest release of CCE. CISL IT is working to get this installed once HPE has fixed more bugs that others have reported as well. Once a 16.x.x version has been added to the stack on Derecho, I will revisit this pull request to test and hopefully close it.

In the mean time, we will need to keep using Intel to use the random number generator code and therefore perturb_from_single_instance on Derecho.

hkershaw-brown commented 11 months ago

no new version of CCE on Derecho as of Jan 2023. Closing as this is a CCE bug rather than a DART bug.

hkershaw-brown commented 9 months ago

@c-merchant A new 🎉 cce compiler version cce/16.0.1 is now available on Derecho. Can you give your ran_unif test a spin on Derecho with this new compiler.

For reference, here's your pull request with the random number test: https://github.com/NCAR/DART/pull/549 Let's see if cce/16 has the bug fixed.

hkershaw-brown commented 9 months ago

This bug is fixed in cce/16.0.1 now available on Derecho.