E3SM-Project / E3SM

Energy Exascale Earth System Model source code. NOTE: use "maint" branches for your work. Head of master is not validated.
https://docs.e3sm.org/E3SM
Other
343 stars 352 forks source link

non-bfb results with SMS_D_PMx2_Ln5.ne30_oECv3.F2010 --compiler=gnu (DEBUG plus threading) #5875

Open ndkeen opened 1 year ago

ndkeen commented 1 year ago

On GCP12 (using next of Aug10) (and on pm-cpu, see below comments). there is a test showing different results between runs. Seems to be with GNU/DEBUG and threading.

SMS_D_Ln5.ne30_oECv3.F2010.gcp12_gnu.C.20230810_184148_icsnhz/run/atm.log.25396.230810-185308.gz: nstep, te        6   0.26327188655393944E+10   0.26327220866024842E+10   0.17809334694391188E-03   0.98542160462986372E+05
SMS_D_Ln5.ne30_oECv3.F2010.gcp12_gnu.G.20230810_171201_yb8tm7/run/atm.log.25376.230810-172936.gz: nstep, te        6   0.26327191183128428E+10   0.26327222939713116E+10   0.17558291039020528E-03   0.98542162828852393E+05
SMS_D_Ln5.ne30_oECv3.F2010.gcp12_gnu.r00/run/atm.log.25545.230811-160939.gz: nstep, te        6   0.26327190694097590E+10   0.26327222492494040E+10   0.17581408918316275E-03   0.98542162509758753E+05
SMS_D_Ln5.ne30_oECv3.F2010.gcp12_gnu.r01/run/atm.log.25543.230811-160939.gz: nstep, te        6   0.26327190279278388E+10   0.26327222114278817E+10   0.17601647382655078E-03   0.98542162303609221E+05
SMS_D_Ln5.ne30_oECv3.F2010.gcp12_gnu.r02/run/atm.log.25544.230811-160939.gz: nstep, te        6   0.26327190697470407E+10   0.26327222474688978E+10   0.17569699498611472E-03   0.98542163185255980E+05

These 2 tests were also giving me trouble with comparing to baselines (resolved with PR from Wuyin). The nstep values written in atm.log files are the same between multiple runs -- however, cprnc on the resulting eam netcdf files show DIFFERENT. ?

SMS_Lm1.ne4_oQU240.F2010.gcp12_gnu
SMS_Ly1.ne4_oQU240.F2010.gcp12_gnu

Therefore, for these 3 tests, I'm unable to bless the results and we still see 3 baseline fails.

ndkeen commented 1 year ago

When I apply the fix in PR #5886, I do see that the following two tests are now passing compare (where otherwise the eam output was not correct):

SMS_Lm1.ne4_oQU240.F2010.gcp12_gnu
SMS_Ly1.ne4_oQU240.F2010.gcp12_gnu

However, the other test (named in title) still shows non-bfb behavior with consecutive runs.

ndkeen commented 1 year ago

I still see same issue of non-bfb results with testname in title. I do have a little more info. With OPT build, seems OK. Without threading, also seems OK in both DEBUG and OPT builds. When I say "seems OK", I just mean two consecutive runs have identical results in atm.log. So it sounds like issue only appearing with DEBUG and with threading.

ndkeen commented 1 year ago

OK, the tests on pm-cpu do not show this issue as they are not using threads. When I ask for threads, I do see same issue as on GCP -- where I get non-bfb behavior in DEBUG builds.

SMS_P128x2_D_Ln5.ne30_oECv3.F2010.pm-cpu_gnu

And as easier one-test, this will fail ERS_D_P128x2_Ln5.ne30_oECv3.F2010.pm-cpu_gnu

ndkeen commented 1 year ago

I can also reproduce on chrysalis. Just need to use threading and GNU. These two cases are BFB after step=0 and step=1 (just by atm.log), but are different at step=2.

/lcrc/group/e3sm/ac.ndkeen/scratch/chrys/m35-aug22/SMS_D_P128x2_Ln5.ne30_oECv3.F2010.chrysalis_gnu.20230828_200811_wc2zbg
/lcrc/group/e3sm/ac.ndkeen/scratch/chrys/m35-aug22/SMS_D_P128x2_Ln5.ne30_oECv3.F2010.chrysalis_gnu.20230828_200757_1xnput

Oh wait.. zdiffing log files, I see a difference much earlier on. Could be a clue to someone who knows what this is writing?

chrlogin1% zdiff /lcrc/group/e3sm/ac.ndkeen/scratch/chrys/m35-aug22/SMS_D_P128x2_Ln5.ne30_oECv3.F2010.chrysalis_gnu.20230828_200811_wc2zbg/run/atm.log.379557.230828-201620.gz /lcrc/group/e3sm/ac.ndkeen/scratch/chrys/m35-aug22/SMS_D_P128x2_Ln5.ne30_oECv3.F2010.chrysalis_gnu.20230828_200757_1xnput/run/atm.log.379558.230828-201620.gz | mo
152c152
<   ********** CASE = SMS_D_P128x2_Ln5.ne30_oECv3.F2010.chrysalis_gnu.20230828_200811_wc2zbg **********
---
>   ********** CASE = SMS_D_P128x2_Ln5.ne30_oECv3.F2010.chrysalis_gnu.20230828_200757_1xnput **********
8345,8346c8345,8346
<  qv( 56)=   0.2096139035114711E-19  0.1280848678281036E-07  0.1767874819694057E-03
<  qv( 57)=   0.3873632091571628E-18  0.8967851401500604E-09  0.4058933140593721E-04
---
>  qv( 56)=   0.2096139035051049E-19  0.1280848678280896E-07  0.1767874819740639E-03
>  qv( 57)=   0.3873632091571628E-18  0.8967851401500604E-09  0.4058933140596892E-04
8348,8349c8348,8349
<  qv( 59)=   0.1133289316825623E-21  0.1787397675075774E-08  0.1685042579143148E-03
<  qv( 60)=   0.8043324805296895E-23  0.2534140746239502E-07  0.1695395478318163E-03
---
>  qv( 59)=   0.1133289316825623E-21  0.1787397675078643E-08  0.1685042579096250E-03
>  qv( 60)=   0.8043328226409883E-23  0.2534140746707827E-07  0.1695394322629078E-03
...

And trying the same atm.log zdiff on pm-cpu, there is a difference even before that:

2439c2439
<  chemini: f107,f107a =    1.1492717671597093E-310   2.1194546651052796E-317
---
>  chemini: f107,f107a =    1.1296010226493952E-310   2.1194546651052796E-317

which may be easier to find in the fortran. Note I also tried same test with gcc12.2.0 and see same thing.

       !-----------------------------------------------------------------------                                                                                                                                                                                                                                                           
       !        ... initialize the solar parameters module                                                                                                                                                                                                                                                                                
       !-----------------------------------------------------------------------                                                                                                                                                                                                                                                           
       call solar_parms_get( f107_s = f107, f107a_s = f107a )
       if (masterproc) write(iulog,*) 'chemini: f107,f107a = ',f107,f107a

ok, solar_parms_on is FALSE so what's happening with f107 vars is the routine simply returns without setting them to anything, so when it prints to file, who knows what's there. But that is not the issue here. I will have to find where qv() is printed.

ndkeen commented 10 months ago

Trying again with master of Oct30, but I get a FP issue instead, which I'm assuming is not related.

  3: Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
  3: 
  3: Backtrace for this error:
  3: #0  0x14c4d27dddbf in ???
  3: #0  0x14c4d27dddbf in ???
  3: #1  0x15072fa in __radheat_MOD_radheat_tend
  3:    at /global/cfs/cdirs/e3sm/ndk/repos/me32-oct30/components/eam/src/physics/cam/radheat.F90:116
  3: #1  0x15072fa in __radheat_MOD_radheat_tend
  3:    at /global/cfs/cdirs/e3sm/ndk/repos/me32-oct30/components/eam/src/physics/cam/radheat.F90:116
  3: #2  0xcdc51b in __radiation_MOD_radiation_tend
  3:    at /global/cfs/cdirs/e3sm/ndk/repos/me32-oct30/components/eam/src/physics/rrtmg/radiation.F90:1588
  3: #2  0xcdc51b in __radiation_MOD_radiation_tend
  3:    at /global/cfs/cdirs/e3sm/ndk/repos/me32-oct30/components/eam/src/physics/rrtmg/radiation.F90:1588
  3: #3  0x14a5240 in tphysbc
  3:    at /global/cfs/cdirs/e3sm/ndk/repos/me32-oct30/components/eam/src/physics/cam/physpkg.F90:3052
  3: #4  0x14cbd4d in __physpkg_MOD_phys_run1._omp_fn.0
  3:    at /global/cfs/cdirs/e3sm/ndk/repos/me32-oct30/components/eam/src/physics/cam/physpkg.F90:1175
  3: #3  0x14a5240 in tphysbc
  3:    at /global/cfs/cdirs/e3sm/ndk/repos/me32-oct30/components/eam/src/physics/cam/physpkg.F90:3052
  3: #4  0x14cbd4d in __physpkg_MOD_phys_run1._omp_fn.0
  3:    at /global/cfs/cdirs/e3sm/ndk/repos/me32-oct30/components/eam/src/physics/cam/physpkg.F90:1175
  3: #5  0x14c4d2dfe295 in GOMP_parallel
  3:    at ../../../cpe-gcc-11.2.0-202108140355.9bf1fd589a5c1/libgomp/parallel.c:178
  3: #6  0x14bc5e8 in __physpkg_MOD_phys_run1
  3:    at /global/cfs/cdirs/e3sm/ndk/repos/me32-oct30/components/eam/src/physics/cam/physpkg.F90:1154
  3: #7  0x6532cf in __cam_comp_MOD_cam_run1
  3:    at /global/cfs/cdirs/e3sm/ndk/repos/me32-oct30/components/eam/src/control/cam_comp.F90:268